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Abstract 


Integration  of  Physical  Design  and  Sequential  Optimization 

by 

Philip  Chong 

Doctor  of  Philosophy  in  Engineering  —  Electrical  Engineering  and  Computer  Sciences 

University  of  California,  Berkeley 
Professor  Robert  K.  Brayton,  Chair 

This  work  examines  the  interaetion  between  the  physieal  design  of  digital  integrated  eireuits  and  se¬ 
quential  optimization  teehniques  used  for  performanee  enhaneement.  In  partieular,  the  integration 
of  lioorplanning  and  plaeement  with  retiming  and  eloek  skew  seheduling  is  explored.  A  theoretieal 
result  is  given  whieh  addresses  the  eomputational  eomplexity  of  eireuit  partitioning  under  eon- 
straints  derived  from  sequential  optimization;  this  motivates  the  need  for  heuristie  approaehes  to 
the  related  plaeement  problem.  Another  theoretieal  result  provides  a  eharaeterization  of  the  feasible 
retimings  of  a  sequential  eireuit;  this  result  is  used  to  motivate  an  effeetive  method  for  lioorplanning 
integrated  with  sequential  optimization.  Praetieal  teehniques  for  using  sequential  slaek  to  drive 
standard-eell  plaeement  are  shown  here;  experiments  demonstrate  signiheant  improvement  in  fi¬ 
nal  design  performanee  using  these  methods.  Another  part  of  this  work  examines  how  the  role  of 
sequential  optimization  and  physieal  design  ehanges  when  the  design  allows  for  asynehronous  or 
lateney-insensitive  eommunieation  between  modules.  A  theoretieal  result  relating  to  the  problem 
of  eloek  tree  implementation  for  eloek  skew  seheduling  under  proeess  variation  is  given.  Finally  an 
experimental  teehnique  for  lioorplanning  using  nonlinear  programming  is  demonstrated. 


Professor  Robert  K.  Brayton 
Dissertation  Committee  Chair 


To  Mom  and  Dad 
And  All  My  Friends 
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Chapter  1 

Introduction 


1.1  Motivation 

The  progress  of  semieonduetor  fabrieation  teehnology  is  ehanging  the  impaet  of  inter- 
eonneet  delay  on  the  performanee  of  digital  integrated  eireuits.  In  the  past,  sueh  delays  were  often 
ignored  during  the  design  proeess,  as  they  were  eonsidered  negligible  eompared  to  the  delays  of  the 
gates  in  the  eireuit.  However,  today  intereonneet  delay  forms  a  substantial  portion  of  the  total  eireuit 
delay.  [SK99]  suggests  between  24  and  36  pereent  of  total  eireuit  delay  eomes  from  intereonneets 
in  fabrieation  proeesses  typieally  in  use  today. 

Looking  beyond  today’s  proeesses  shows  an  expeeted  trend  of  inereasing  impaet  of  inter¬ 
eonneet  delay.  Table  1.1  shows  the  predieted  evolution  of  intereonneet  delay  based  on  projeetions 
from  [ITR03].  The  first  row  indieates  the  year  of  projeetion.  The  next  three  rows  indieate  the  antie- 
ipated  RC  delay  assoeiated  with  1mm  lengths  of  Metal  1  wire,  intermediate-length  wire  and  global 
wire,  respeetively. 

Of  eourse,  looking  only  at  delay  for  a  1mm  length  of  wire  ean  be  misleading,  as  this  does 
not  aeeount  for  any  ehange  in  gate  sizing.  As  gate  sizes  shrink,  a  fixed  length  of  wire  will  span  an 
inereasing  number  of  gates  and  thus  represent  a  more  substantial  intereonneet.  Put  another  way,  if 
a  design  is  implemented  in  an  smaller  teehnology,  wire  lengths  will  shrink  aeeordingly. 

To  put  this  in  proper  eontext,  we  use  the  personal  digital  assistant  system-on-a-ehip  (PDA 
SOC)  design  driver  (also  presented  in  [ITR03])  to  normalize  these  figures.  The  PDA  represents  a 
typieal  design  one  might  wish  to  implement  on  an  ASIC.  The  projeeted  eloek  frequeney  and  proeess 
teehnology  (feature  size)  for  this  applieation  is  indieated  in  the  table.  The  last  three  rows  of  Table  1 . 1 
shows  the  produet  of  the  RC  delay  eonstant  (taken  as  ps/mm)  for  eaeh  of  the  three  wire  types  with 
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Year 

2003 

2006 

2009 

2012 

2015 

2018 

RC,  1mm  Metal  1  (ps) 

191 

355 

595 

963 

1510 

2679 

RC,  1mm  Intermed.  Wire  (ps) 

105 

224 

358 

552 

908 

1582 

RC,  1mm  Global  Wire  (ps) 

42 

87 

139 

220 

354 

618 

PDA  SOC  Design  Driver 

Clock  Freq.  (MHz) 

300 

450 

600 

900 

1200 

1500 

Process  Tech,  (nm) 

101 

90 

65 

45 

32 

22 

RC-Size  Product,  Metal  1 

0.0193 

0.0320 

0.0387 

0.0433 

0.0483 

0.0589 

RC-Size  Product,  Intermed.  Wire 

0.0106 

0.0202 

0.0233 

0.0248 

0.0291 

0.0348 

RC-Size  Product,  Global  Wire 

0.00424 

0.00783 

0.00904 

0.00990 

0.0113 

0.0136 

Table  1.1:  ITRS  interconnect  technology  projections.  From  [ITR03]. 

the  feature  size  (in  nm).  Taking  this  product  effectively  normalizes  the  RC  delay  values  in  terms 
of  the  feature  size.  Of  course  this  is  not  exact,  as  wire  delay  is  not  exactly  a  linear  function  of 
wire  length.  However,  such  a  normalization  is  useful  to  roughly  account  for  the  expected  decrease 
in  feature  size.  Here  we  see  that  this  normalized  delay  is  expected  to  increase  roughly  threefold, 
looking  forward  to  2018. 

On  top  of  the  purely  physical  effects  of  process  scaling,  designs  are  also  being  expected 
to  run  at  increasingly  faster  rates  (see  the  clock  frequency  row  in  Table  1.1).  Furthermore,  designs 
are  growing  larger  and  more  complex  as  consumers  demand  more  features  in  the  products  which 
contain  these  ICs.  The  combined  effect  of  all  these  trends  is  to  make  it  increasingly  difficult  for 
designers  to  achieve  the  circuit  performance  necessary. 

This  work  focuses  on  two  powerful  techniques  used  by  digital  circuit  designers  for  perfor¬ 
mance  optimization.  Retiming  [LS83,  LS91]  and  clock  skew  scheduling  [Fis90,  DS95]  are  methods 
which  can  be  used  to  improve  design  performance.  We  call  these  sequential  optimization  tech¬ 
niques,  as  they  require  changing  the  nature  of  the  sequential  elements  within  the  targeted  design. 
Retiming  involves  changing  the  structural  location  of  the  sequential  elements,  while  clock  skew 
scheduling  involves  changing  the  relative  clock  skews  between  the  sequential  elements. 

Of  course,  performance  optimization  requires  accurate  modeling  and  prediction  of  in¬ 
terconnect  delays  in  the  design.  Since  such  delays  are  dependent  on  the  length  of  the  wires,  this 
means  that  the  problem  of  sequential  optimization  and  the  physical  implementation  of  the  design 
are  closely  interrelated.  This  indicates  that  an  integration  of  physical  design  and  sequential  opti¬ 
mization  is  necessary  to  achieve  good  results  with  these  techniques. 

Our  work  focuses  on  the  floorplanning  and  placement  aspects  of  the  physical  design  prob- 
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lem.  Of  course,  other  aspects  of  physical  design,  such  as  routing  and  clock  tree  implementation, 
can  affect  sequential  optimization  greatly.  However,  these  are  not  the  focus  of  this  work.  This  thesis 
explores  how  sequential  optimization  can  be  integrated  with  floorplanning  and  placement  tools,  and 
presents  new  techniques  for  performance  optimization  of  digital  synchronous  circuits  using  these 
ideas. 


1.2  Existing  Work 

Placement  and  floorplanning  are  processes  for  determining  non-overlapping  locations  for 
circuit  elements  on  a  silicon  die  while  minimizing  a  given  cost  function.  Placement  is  distinguished 
from  floorplanning  in  that  placement  is  the  term  generally  used  with  objects  at  a  fine-grained  level 
(e.g.  standard  cells  each  representing  a  single  gate  of  logic),  while  floorplanning  involves  much 
larger  objects  (e.g.  macroblocks  composed  of  hundreds  of  thousands  of  gates).  A  typical  design 
flow  might  utilize  both  floorplanning  and  placement  techniques,  especially  if  the  design  is  large  and 
has  been  designed  in  a  hierarchical  fashion,  or  if  the  design  is  built  from  pre-existing  macroblocks. 

1.2.1  Placement 

Placement  is  typically  performed  in  two  stages.  The  first,  called  global  placement,  con¬ 
cerns  finding  general  locations  for  the  standard  cells  on  the  overall  die.  The  result  from  global 
placement  may  not  have  all  overlaps  between  cells  resolved,  but  the  spreading  of  cells  is  sufficiently 
uniform  to  allow  the  next  step,  detailed  placement  or  legalization,  which  transforms  the  nearly-legal 
result  of  global  placement  to  a  final  legalized  placement  with  no  cell  overlaps,  to  be  performed  with 
less  cost  of  computation.  However  detailed  placement  generally  focuses  on  optimizations  localized 
to  small  areas  of  the  die  [DJS91,  KMR04]  or  optimization  with  the  objective  to  minimize  total  per¬ 
turbation  with  respect  to  the  global  placement  [BV04].  Therefore  in  this  work  we  focus  on  global 
placement,  as  the  greatest  optimization  potential  lies  in  this  step. 

Placement  has  been  well-studied  in  the  past  for  several  interesting  cost  functions.  The 
most  common  cost  function  used  in  the  literature  is  wirelength.  Although  wirelength  by  itself  is  not 
a  very  useful  metric,  it  has  the  advantage  of  being  relatively  easy  to  optimize,  and  does  represent 
a  coarse  measure  of  routing  congestion  [WSOO].  Wirelength  has  also  been  observed  to  be  roughly 
correlated  with  timing  of  the  design  [CKM+99a]. 

Nearly  all  modern  global  placement  techniques  fall  into  two  categories.  The  first  category 
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contains  what  are  known  as  analytic  placers.  These  formulate  the  placement  problem  abstractly  as  a 
quadratic  program  as  follows:  Let  G  =  {V,E)  be  a  graph  representing  the  circuit,  where  the  vertices 
represent  the  standard  cells  to  be  placed  and  the  edges  represent  the  interconnections  between  the 
cells.  The  goal  is  then  to  find  locations  (v(v),y(v))  G  for  all  vertices  v  G  T  to  minimize 

£  {x{u)-x{v)f  +  {y{u)-y{v)f 

(u,v)^E 

The  cost  function  serves  as  a  rough  estimate  for  the  wirelength.  As  the  cell  locations  collapse  onto 
a  single  point  if  all  vertices  are  unconstrained,  additional  vertices  with  fixed  locations  are  added  fo 
fhe  problem.  These  exfra  verfices  are  fypically  faken  fo  be  I/O  pins  around  fhe  die  boundary  so  fhaf 
fhe  movable  cells  will  lie  wifhin  fhe  convex  hull  of  fhe  fixed  pins. 

The  main  difficulfy  wifh  fhe  analyfic  approach  is  fhaf  fhe  placemenf  resulf  fends  fo  be  clus- 
fered  in  fhe  cenfer  of  fhe  die  and  is  insufficienfly  spread  ouf  for  fhe  subsequenf  defailed  placemenf 
sfep.  This  has  resulfed  in  a  focus  in  fhe  liferafure  on  fechniques  for  spreading  fhe  cells.  Com¬ 
monly  fhis  is  done  iferafively,  eifher  fhrough  fhe  addition  of  “spreading  forces”  or  exfra  fixed  poinfs 
[EJ98,  HMS02,  VC04]  or  using  graph  parfifioning  fechniques  fo  subdivide  fhe  die  info  disjoinf  areas 
and  force  spreading  by  assigning  cells  fo  fhese  areas  [TKH88,  TK91a,  KSJ88]. 

The  parfifioning-based  cell  spreading  fechniques  evenfually  developed  info  fhe  second 
category  of  global  placemenf  fechniques,  fhose  known  as  partitioning-based  placers.  These  forgo 
fhe  use  of  fhe  quadrafic  program  alfogefher,  and  insfead  make  use  of  recursively  parfifioning  fhe 
neflisf  graph  and  assignmenf  of  parfifions  fo  disjoinf  die  areas  fo  enforce  adequate  cell  spreading. 
Typically  some  form  of  fhe  mincut  partitioning  problem  is  used,  which  is  as  follows:  Find  a  parfifion 
(A,B)  of  fhe  vertices  V  such  fhaf  |A|  =  |B|  and  fhe  size  of  fhe  sef  of  cuf  nefs 

\{{u,v)  G  £■  :  (m  G  A  A  V  G  B)  V  (m  G  B  A  V  G  A)}| 

is  minimized.  Usually  a  varianf  of  fhe  Fiduccia-Maffheyses  heurisfic  is  used  fo  solve  fhe  graph  par- 
fifioning  problem  [FM82,  KAKS97,  CKM99b].  Parfifioning-based  placers  have  shown  reasonably 
good  correlafion  befween  fhe  use  of  fhe  mincuf  parfifioning  heurisfic  and  fhe  wirelengfh  of  fhe  final 
resulf  [A+99,  CKMOO]. 

1.2.2  Floorplanning 

Floorplanning  is  chiefly  disfinguished  from  placemen!  by  fhe  need  fo  deal  wifh  large  mac¬ 
roblocks.  Unlike  sfandard  cell  gales,  which  for  a  parlicular  design  are  all  chosen  from  a  library 
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where  all  eells  have  uniform  height,  macrobloeks  ean  vary  signifieantly  in  size,  from  thousands  to 
hundred  of  thousands  of  gates.  Moreover,  lioorplanning  tools  must  aeeount  for  soft  macroblocks, 
whieh  represent  a  subdesign  of  known  logical  structure,  but  without  layout  information.  Such  soft 
blocks  will  have  a  fixed  area,  but  their  aspect  ratio  may  vary. 

The  literature  has  mainly  focused  on  floorplanning  as  an  opfimizafion  problem  where  fhe 
cosf  funclion  is  a  linear  weighted  sum  of  fhe  fofal  die  area  (i.e.  fhe  sum  of  fhe  macroblock  areas 
plus  any  wasfed  space  due  fo  packing  inefficiencies)  and  fofal  wirelengfh  of  fhe  buses  connecfing 
fhe  macrobloeks. 

Numerous  fechniques  have  been  presenfed  in  fhe  liferafure  for  lioorplanning.  The  oldesf 
of  fhese  use  slicing  tree  sfruclures  fo  represenf  fhe  relafive  posifions  of  fhe  macrobloeks  [Bre77, 
Oll82,  WL86].  A  slicing  free  is  a  binary  free  graph  where  fhe  leaves  represenf  fhe  modules  fo  be 
placed  and  fhe  infernal  nodes  represenf  horizonlal  and  verlical  cutlines  al  fhe  appropriale  level  of 
fhe  hierarchy.  Some  newer  approaches  use  non-slicing  sfruclures  of  various  lypes  fo  represenf  fhe 
relative  positions  [MFNK96,  NFMK96,  GCY99,  H+00]. 

The  key  fealure  of  fhese  floorplan  represenlafions  is  lhaf  Ihey  represenf  fhe  layoul  of 
fhe  macrobloeks  in  a  compacl,  easily  manipulaled  form.  For  all  fhese  floorplan  represenlafions, 
efficienf  algorilhms  exisl  fo  converl  fhe  compacl  represenfalion  info  aclual  localions  and  aspecl 
ratios  for  fhe  macrobloeks.  Thus  fhe  search  space  of  feasible  floorplans  is  reduced.  In  fhe  liferafure, 
fhe  opfimizafion  technique  typically  used  is  simulaled  annealing  based  on  manipulalions  of  fhe 
underlying  floorplan  represenfalion. 

1.2.3  Sequential  Optimization 

We  consider  Iwo  fechniques  for  sequential  opfimizafion.  The  firsl  is  retiming,  which  is 
based  on  fhe  observalion  lhaf  replacing  registers  located  al  fhe  oulpuf  of  a  gate  wilh  registers  located 
al  fhe  inpul  of  fhe  gate  does  nol  change  fhe  functionality  of  fhe  circuil  [LS83,  LS91].  By  applying 
such  regisler  movemenl  operations,  various  oplimizafions  can  be  achieved,  such  as  minimizing  fhe 
fofal  number  of  regislers  or  minimizing  fhe  clock  period  of  fhe  design. 

Clock  skew  scheduling  is  fhe  second  lechnique  for  sequenlial  opfimizafion  which  we  con¬ 
sider.  This  involves  adjusting  fhe  delay  (skew)  of  fhe  clock  signals  fo  fhe  individual  regislers  of  fhe 
design,  in  order  fo  improve  fhe  overall  circuil  performance  [Fis90,  DS95].  [Fis90]  observes  lhaf 
clock  skew  scheduling  can  be  considered  equivalenl  fo  reliming;  instead  of  physically  moving  reg¬ 
isters  across  gales,  clock  skew  scheduling  moves  registers  virlually  by  delaying  Iheir  clock  signal 
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appropriately. 

There  have  been  some  efforts  to  integrate  retiming  and  plaeement.  [CLOO,  LimOO]  ad¬ 
dresses  the  problem  of  “physieal  planning”  (partitioning  in  the  eontext  of  geometrie  layout)  under 
the  freedom  to  perform  retiming  on  the  final  design.  There  the  notion  of  sequential  slack  is  intro- 
dueed,  and  an  iterative  net  weighting  seheme  is  used  during  partitioning.  [YMS03]  avoids  iterative 
teehniques  by  using  slaek  budgeting.  [SB02]  demonstrates  an  approaeh  to  field-programmable  gale 
array  design  whieh  aeeounls  for  reliming.  These  feehniques  have  various  diffieullies  assoeialed  wilh 
Ihem,  and  we  eonlrasl  fhese  wilh  our  work  in  Chapter  4. 

1.3  Outline 

Here  we  oulline  Ihe  slruelure  of  Ihe  subsequenl  ehaplers  of  Ihis  Ihesis  and  highlighl  Ihe 
eonlribulions  presented  Iherein. 

Chapter  2  shows  Ihe  lA^iP-eompleteness  of  a  simple  parlilioning  problem  under  sequential 
timing  eonslrainls.  This  resull  provides  Iheorelieal  juslilieation  for  our  heurislie  approaeh  laken 
Ihrough  Ihe  resl  of  Ihe  Ihesis.  Thai  is,  Ihis  proof  motivates  Ihe  need  lo  allaek  sequentially-based 
physieal  design  problems  wilh  heurislie  teehniques,  ralher  lhan  seek  exael  optimal  solutions.  This 
key  motivation  was  nol  provided  in  any  of  Ihe  existing  works  in  Ihis  area. 

Chapter  3  presenls  a  novel  proof  for  a  Iheorem  whieh  eharaelerizes  all  feasible  relim- 
ings  of  a  eireuil.  While  an  equivalenl  Iheorem  was  previously  proven  in  [SSBSV92],  our  proof  is 
signifieanlly  differenl,  and  provides  a  praelieal  eonslruelive  leehnique  for  Iransforming  a  given  re¬ 
timing  into  any  retiming  eompalible  wilh  Ihe  given  one.  This  ehapler  also  presenls  a  reliming-aware 
lloorplanning  appliealion  whieh  lakes  advanlage  of  Ihis  Iheorem. 

Chapter  4  addresses  some  shorleomings  of  Ihe  leehnique  given  in  Chapter  3.  The  eoneepl 
of  sequential  slack  is  presented,  and  sequential  timing-aware  plaeemenl  teehniques  using  heurislies 
based  on  Ihis  melrie  are  given. 

Chapter  5  extends  our  ideas  to  Ihe  domain  of  asynehronously-eommuniealing  systems. 
Here,  adding  laleney  to  eommuniealion  palhs  has  no  effeel  on  Ihe  eorreelness  of  a  design  (unlike 
wilh  synehronous  systems,  where  exeessive  laleney  leads  to  ineorreel  resulls).  However,  laleney 
does  affeel  performanee,  resulting  in  a  differenl  optimization  problem.  We  presenl  a  leehnique 
for  quiekly  estimating  Ihe  performanee  impael  for  a  given  assignmenl  of  laleneies  along  inler- 
eonneel,  and  use  Ihis  to  ereale  a  performanee-driven  floorplanning  algorilhm  for  asynehronously- 
eommuniealing  systems. 
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Chapter  6  discusses  the  problem  of  realization  of  clock  trees  for  synchronous  digital  cir¬ 
cuits.  A  sufficient  condition  for  the  solution  to  the  optimal  clock  scheduling  problem  in  the  face  of 
process  variations  is  given. 

Chapter  7  presents  a  new  experimental  technique  for  floorplanning  using  a  general-pur¬ 
pose  nonlinear  programming  package.  While  the  results  are  inconclusive,  there  is  still  hope  that  the 
general  optimization  framework  of  nonlinear  programming  can  allow  more  sophisticated  modeling 
of  the  flexibility  introduced  by  sequential  optimization. 

Chapter  8  summarizes  the  contributions  presented  in  this  thesis. 


Chapter  2 


On  The  Complexity  Of  Partitioning 
Under  Sequential  Timing  Constraints 

2.1  Introduction 

Graph  partitioning  techniques  form  a  critical  core  for  many  placement  algorithms  cur¬ 
rently  in  use  today.  Some  algorithms  rely  on  partitioning  alone  [CKMOO,  LimOO,  A+99],  others  use 
partitioning  as  a  subproblem  [KSJA91,  TKH88].  In  all  cases,  partitioning  is  used  to  subdivide  large 
intractable  problems  into  smaller  ones  which  can  be  more  readily  solved. 

Partitioning  is  a  graph-theoretic  problem.  In  a  circuit  context,  circuit  elements  (e.g.  gates) 
are  represented  by  vertices  of  the  graph,  and  interconnections  (e.g.  wires)  between  the  elements  are 
represented  by  edges.  Traditionally,  partitioning  has  been  applied  in  placement  with  minimization 
of  the  number  of  cut  edges  in  the  graph  as  the  cost  function.  Min-cut  partitioning  has  been  found 
to  be  an  effective  heuristic  for  placement  where  the  minimization  of  total  wire  length  is  the  primary 
metric  used  to  evaluate  the  final  placed  design  [CKMOO,  A+99].  Intuitively,  this  is  due  to  how  min- 
cut  partitioning  separates  a  network  into  localized  subnetworks.  Interconnects  which  are  cut  by 
the  partition  tend  to  become  long  global  wires  in  the  final  design,  whereas  inferconnecfs  confained 
wifhin  a  single  subparfifion  lend  lo  be  shorler  local  wires.  Thus  fhere  is  generally  a  good  correlation 
belween  fhe  number  of  cul  edges  during  parfilioning  and  fhe  lofal  wire  lengfh  in  fhe  final  placemenl. 

In  praclice,  ofher  considerations  besides  wirelengfh  musl  be  considered.  Mosl  nolably, 
performance  of  fhe  final  placed  design  is  oflen  a  crifical  concern.  Historically,  much  of  fhe  research 
in  performance-driven  partitioning  has  been  focused  on  combinational-liming  driven  placemenl. 
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However,  recent  work  has  dealt  with  partitioning  for  performance  in  a  sequential  setting,  allowing 
for  retiming  and  clock  skew  scheduling  to  take  place  [LimOO,  PKL98]. 

A  common  approach  for  performance-driven  partitioning  is  to  use  net  weighting  together 
with  the  usual  min-cut  partitioning  techniques,  in  order  to  prevent  critical  nets  whose  delays  have  a 
large  impact  on  performance  from  being  cut  by  the  partition.  This  heuristic  net  weighting  technique 
was  developed  for  combinational  timing-driven  partitioning,  but  has  been  adopted  by  researchers 
looking  at  sequential-timing  driven  partitioning  as  well  [LimOO,  PKL98].  However,  the  complexity 
of  partitioning  under  sequential  flexibility  was  previously  not  made  clear,  and  the  justification  for 
adopting  such  a  heuristic  technique  was  not  based  on  theoretical  grounds. 

In  this  chapter,  we  present  a  proof  for  the  fA^iP-completeness  of  partitioning  under  a  se¬ 
quential  performance  metric.  This  shows  that  the  adoption  of  such  heuristic  techniques  is  well- 
justified. 

2.1.1  Definitions 

In  the  following,  let  G  =  (P,^)  be  a  finite  directed  graph. 

Definition  2.1  (Partition).  A  partition  {A,B)  of  the  vertices  P  is  a  pair  of  subsets  A  C  P,B  C  P 
such  that  ALIB  =  V  and  A  n B  =  0. 

Definition  2.2  (Cut  Edge).  Edge  (m,v)  G  E  is  cut  by  partition  (A,B)  if  either  n  G  A  and  v  G  B,  or 
M  G  B  and  v  G  A. 

Definition  2.3  (Cycle).  A  cycle  £  in  G  is  a  nonempty  ordered  list  of  edges 

£=  {{uuVi),{u2,V2),...,{u\i\,v\i\))  {ui,Vi)  eE,i  G  {1,...,|£|} 

such  that  Vi  =  G  {1, . . . ,  |i?|  —  1},  V|^|  =  ui,  and  all  the  n,  are  unique. 

For  simplicity,  here  the  term  cycle  is  used  to  refer  to  simple  cycles  (with  unique  vertices) 
only.  Define  G(G)  to  be  the  set  of  all  cycles  in  G. 

Definition  2.4  (n-Cycle).  An  n-cycle  is  a  cycle  with  n  edges  (equivalently,  n  vertices). 

Let  ^  M  be  a  labeling  of  the  edges  of  G  with  real  numbers,  d  represents  the  delay 
for  signals  which  travel  across  the  edges. 
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Definition  2.5  (Maximum  Mean  Cycle).  The  maximum  mean  cycle  (MMC)  of  graph  G  under 
labeling  d  is 

MMC(G,<i)  =  max  — - 

teC{G)  \l\ 

Several  algorithms  are  known  for  the  effieient  eomputation  of  the  MMC  in  polynomial¬ 
time  [DIG98,  Kar78]. 


2.2  Maximum  Mean  Cycle  Partitioning 

Consider  the  following  model  of  a  gate-level  network:  let  G  =  {y,E)  he  n  direeted  graph 
where  V  represents  the  registers  of  the  network,  and  E  the  paths  eonneeting  these  registers  (possibly 
through  eombinational  gates).  The  delay  labeling  d  then  represents  the  maximum  eombinational 
delay  between  the  registers  of  the  design. 

Under  this  model,  the  maximum  mean  eyele  MMC(G,r/)  represents  the  minimum  eloek 
eyele  whieh  ean  be  aehieved  using  eloek  skew  seheduling  teehniques  [Fis90,  SBSV92,  Szy92, 
DS95].  That  is,  the  MMC  beeomes  the  performanee  metrie  to  be  minimized  for  the  design. 

The  impaet  of  partitioning  on  the  MMC  is  modeled  here  in  a  simple  fashion.  A  fixed 
extra  delay  is  added  to  the  the  delay  for  every  edge  whieh  is  eut  by  the  partition,  while  the  delays  for 
the  remaining  edges  are  not  ehanged.  This  is  somewhat  analogous  to  the  use  of  the  min-eut  metrie 
for  partitioning  for  wirelength;  intuitively,  the  eut  edges  beeome  global  intereonneets,  and  thus  will 
ineur  extra  delay,  whereas  uneut  edges  will  tend  to  be  more  loealized  and  thus  faster. 


2.2.1  Problem  Definition 


Formally,  we  define  fhe  maximum  mean  cycle  partitioning  (MMCP)  problem  as  follows: 
Input:  A  direefed  graph  G  =  {V,E)  wifh  edge  delays  d  :  E  bin  size  k  G  Z,  and  euf 
sef  delay  5  G  M,  5  /  0. 

Output:  A  parfifion  {A,B)  of  V,  where  |A|  <  k,  |B|  <  k,  whieh  minimizes  MMC(G,(i'), 
where  d' :  ^  M  is  defined  as 


d'{e) 


d{e)  +  d  if  e  is  euf  by  (A, B) 

< 

d{e)  ofherwise 


The  equivalenf  deeision  problem  is: 
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Input:  A  directed  graph  G  =  {V,E)  with  edge  delays  d  :  E  bin  size  ^  G  Z,  cut  set 
delay  5  G  M,  5  /  0,  and  target  T  G  M. 

Output:  1  if  there  exists  a  partition  {A,B)  of  the  vertices  where  |A|  <  k,  |B|  <  k,  such  that 
MMC(G,r/')  <  T  with  d'  defined  as  above;  otherwise  0. 

From  a  practical  perspective,  using  negative  values  for  5  may  not  make  sense,  because, 
as  noted,  the  cut  edges  are  generally  physically  interpreted  as  corresponding  to  longer  global  wires 
which  incur  more  delay,  not  less  delay.  However,  from  a  theoretical  standpoint,  while  the  complexity 
proof  for  the  case  where  5  <  0  takes  a  similar  general  approach  with  the  the  case  where  5  >  0,  there 
are  notable  differences  in  the  details.  Therefore,  for  completeness,  we  present  both  cases  in  the 
following  section.  As  well,  the  case  5  <  0  is  somewhat  simpler  than  the  case  5  >  0,  so  observing 
the  proof  for  the  former  may  help  understanding  the  proof  for  the  latter. 

2.3  Theoretical  Results 

2.3.1  MMCP  With  Negative  Cut  Set  Delay  Is  NP-Hard 
Theorem  2.6.  Eor  5  <  0,  MMCP  is  ACE-hard. 

Proof  of  Theorem  2.6.  iA^iP-hardness  can  be  shown  by  reduction  from  3SAT.  The  3SAT  problem  is 
[PS98]: 

Input:  A  set 

X  =  {xi,...,x\x\} 

of  boolean  variables  and  a  set 

W  =  {wi,...,wiw\} 

of  boolean  clauses,  such  that 

Wi  =  (yn  Vy,-2  Vyo)  i  G  |1T|} 

where  each  ytj  is  either  a  boolean  literal  xtij  G  A  or  its  negation  and  {ka ,  ka,kif\  are  distinct  for 
any  given  i. 

Output:  1  if  there  exists  an  assignment  of  the  binary  values  {0,1}  to  the  variables  of  X 
such  that  the  boolean  expression 

wi  A  . . .  Awwi 


evaluates  to  1 ;  otherwise  0. 
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Given  an  instance  of  3SAT,  an  equivalent  MMCP  instance  with  5  <  0  can  be  constructed 
as  follows.  Here  vertices  of  the  graph  are  identified  with  tuples  of  set  elements  to  facilitate  identifi¬ 
cation  of  the  MMCP  graph  vertices  with  elements  from  the  original  3SAT  instance. 

Let  Xp  =XVJ  {1};  1  represents  the  constant  boolean  value  1.  Let  Xpj  =  {x  :  x  G  Xp}  be 
the  set  of  literals  generated  by  taking  the  negations  of  the  elements  of  Xp.  The  negation  of  1  is 
the  constant  boolean  value  0.  Assume  WLOG  that  0  and  1  are  distinct  from  all  other  elements  of 
XpLiX^.  For  the  directed  graph  G  =  {V,E),  take  V  =  (Xp  UX^)  x  W.  That  is,  there  is  a  vertex  in  V 
associated  with  every  combination  of  literal  (either  positive  or  negative)  and  clause  from  the  3SAT 
instance. 

Create  a  set  of  edges 

£■1  =  {(m,v)  :  u  =  {xk,w,)  G  V,v  =  {xi,Wt)  GV} 

consists  of  all  edges  created  by  connecting  each  vertex  u  to  all  vertices  v  where  the  literal  asso¬ 
ciated  with  V  is  the  negation  of  the  literal  associated  with  u. 

Create  a  second  set  of  edges  E^  where 

E^=Eju...UEf^^ 

where  for  all  /  G  { 1 , . . . ,  |1T| } 

Ef  =  ((£1,L-2),(£2,£3),(£3,£4),(l-4,L-i)) 
m  =  {yii,wi)  G  V 
ra  =  {yi2,Wi)  G  V 
£3  =  (yi3,Wi)  G  V 
r,-4  =  (0,W,)  G  V 

That  is,  each  Ef  consists  of  a  single  4-cycle,  each  of  which  is  associated  with  the  clause  w,-;  the 
vertices  are  chosen  to  be  among  those  which  correspond  to  the  3  literals  which  appear  in  clause  w, 
along  with  a  vertex  corresponding  to  the  literal  0.  The  order  the  vertices  appear  in  the  cycle  is  not 
relevant  here;  simply  take  them  in  order  of  their  appearance  in  the  given  clause.  E^  is  then  the  union 
of  all  Ef. 

Now  take  E  =  E^  UE^.  Take  d{e)  =  1  for  all  e  G  E,  take  k  =  take  5  =  —1,  and  take 
T  =  1  —  .  This  is  now  an  instance  of  the  MMCP  decision  problem. 

An  example  is  shown  in  Figure  2.1.  Here  the  3SAT  formula  {x\  Vx2  VX3)  A  (xi  V 
X3)  A  (xJV  X2  VX3)  has  been  converted  to  an  MMCP  instance  graph  using  the  above  procedure.  To 
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simplify  the  figure,  double-headed  arrows  indieate  pairs  of  vertiees  whieh  are  mutually  eonneeted 
by  edges  (2-eyeles)  mE^.  Solid  lines  indieate  edges  from  while  dashed  lines  indieate  edges  from 
E^.  The  reader  may  find  it  instruetional  to  identify  the  eyeles  in  E^  in  the  figure  whieh  eorrespond 
to  the  elauses  in  the  3SAT  instanee. 

Note  that  the  above  reduetion  of  3SAT  to  MMCP  is  polynomial  time,  sinee  the  size  of 
the  MMCP  input  graph  G  is  polynomially  bounded  by  the  size  of  the  3SAT  instanee;  |P|  =  (2|A|  + 
2)|1T|  and  l^l  =  |  +  =  (2|A|  +2)|1T|  +4|1T|.  Therefore  it  only  remains  to  be  shown  that  the 

reduetion  is  valid;  that  is,  it  must  be  shown  that  the  MMCP  instanee  gives  a  result  of  1  iff  the  3SAT 
instanee  gives  a  result  of  1 . 

Suppose  the  3SAT  instanee  has  a  satisfying  assignment  of  variables;  that  is,  there  is  an 
assignment  so  that  the  output  of  the  3SAT  problem  is  1.  Construe!  a  partition  (A,B)  of  V  where  A 
eontains  those  vertiees  whose  eorresponding  literal  has  boolean  value  1  under  the  given  assignment, 
and  B  eontains  those  vertiees  whose  eorresponding  literal  has  boolean  value  0.  Note  that  |A|  =  |B|  = 

Lemma  2.7.  Every  cycle  in  G  either  contains  an  edge  from  E^  or  consists  solely  of  edges  from  a 
single  cycle  Ef  for  some  /  G  {1, . . . ,  |W|}. 

Proof  of  Lemma  2.7.  If  a  eyele  in  G  does  not  eontain  any  edge  from  E^,  then  it  ean  only  eontain 
edges  from  a  single  Ef,  as  the  vertiees  of  Ef  form  disjoint  sets.  □ 


Lemma  2.8.  Suppose  {A,B)  is  a  partition  derived  from  a  satisfying  3  SAT  assignment  as  described 
above.  Eor  every  cycle  i  G  C{G),  the  partition  {A,B)  cuts  some  edge  in  £. 

Proof  of  Lemma  2.8.  All  edges  of  E^  are  eut  by  (A,B),  sinee  all  sueh  edges  eonneet  vertiees  assoei- 
ated  with  a  literal  to  another  vertex  whieh  is  assoeiated  with  the  negation  of  that  literal.  Also,  eaeh 
Ef  eontains  at  least  one  eut  edge;  if  this  were  not  the  ease,  then  all  vertiees  of  Ef  would  lie  in  B, 
a  eontradietion,  as  at  least  one  literal  from  the  3SAT  elause  w,  must  have  boolean  value  1.  Using 
Lemma  2.7,  every  eyele  must  have  at  least  one  eut  edge.  □ 


'L(u,v)C:td'{u,v) 


1^1 


<  T 


Lemma  2.9,  Eor  every  cycle  £  G  C{G), 


14 


D 

O 

Q 


IK 

> 

(N 

K 

> 


c3 

C 


.s 

'Td 

CJ 

C/1 

c3 


< 


m  Ch 
K  O 
>  ^ 

I  CN  C/2 

Ik 

\  OX) 

>  -Td 
-  OJ 

^  CJ 

•w 

<  CO 

^  o 

Ik  c 

^  - 

Ik  g 

> 

K  3 
^  o 
c  c/5 
o 


o 

3 

CJ 


in 

Tj 


o 

^  d 

00 

G 

5 

* 

OJ  ■> 

o  ^ 

S 

CO  (]J 
-*-'  -w 

^  o 

C  (]j 

c 

^  8 

It:  ^ 

OJ  3 


3 

O 

CJ 

> 


^  s 
^  > 


15 


Proof  of  Lemma  2.9.  From  the  choices  of  d  and  5, 


d'{e) 


{0  if  e  is  cut  by  (A, B) 
1  otherwise 


Then 


L{u.v)€ed\u,v)  _  c(£) 

1^1  ■  Kl 

where  c(£)  is  the  number  of  edges  of  £  which  are  cut.  The  result  then  falls  from  Lemma  2.8.  □ 


By  Lemma  2.9,  MMC(G,(i')  <  T,  so  the  partition  (A,B)  satisfies  the  the  MMCP  instance, 
and  the  MMCP  instance  generated  from  the  3SAT  instance  gives  a  result  of  1. 

Now  for  the  converse;  that  is,  if  some  partition  (A,B)  of  V  satisfies  fhe  constructed  MMCP 
instance,  then  a  satisfying  assignment  exists  for  the  3SAT  instance. 

Lemma  2.10.  Suppose  (A,B)  is  a  partition  which  satisfies  all  the  conditions  of  the  MMCP  problem. 
For  every  cycle  £  G  C{G),  the  partition  {A,B)  cuts  some  edge  in  £. 


Proof  of  Lemma  2.10.  Again 


d\e)  =  { 


0  if  e  is  cut  by  (A, B) 
1  otherwise 


Now  since  (A,B)  satisfies  fhe  MMCP  instance, 

'L(u.v)eid'{u,v) 


1^1 


so  some  edge  in  £  must  be  cut  by  (A,B). 


□ 


Lemma  2.11.  For  all  y  G  Ap,  Wq  G  IT,  G  IT,  the  vertex  u  =  (y,  Wa)  lies  in  A  iff  the  vertex  v  =  (y,  Wb) 
lies  in  B. 


Proof  of  Lemma  2.11.  From  Lemma  2.10  and  the  fact  that  the  edges 

(m,v)g£'^  and  (v,m)g£'^ 


form  a  cycle. 


□ 
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Lemma  2.11  shows  how  to  use  the  partition  (A,B)  to  derive  a  satisfying  assignment  for 
the  3SAT  instance.  WLOG,  assume  the  vertex  {\,w\)  G  A.  Then  for  every  vertex  in  A  associated 
with  some  positive  literal,  set  the  corresponding  boolean  variable  to  1,  and  for  every  vertex  in  A 
associated  with  some  negative  literal,  set  the  corresponding  boolean  variable  to  0.  By  Lemma  2.11, 
this  assignment  is  consistent:  that  is,  if  a  variable  is  set  to  1,  all  vertices  associated  with  that  variable 
in  its  positive  sense  must  lie  in  A,  and  all  vertices  associated  with  that  variable  in  its  negative  sense 
must  lie  in  B.  Likewise,  for  variables  assigned  to  0,  all  vertices  associated  with  that  variable  in  its 
positive  sense  must  lie  in  B,  and  all  vertices  associated  with  that  variable  in  its  negative  sense  must 
lie  in  A. 

Lemma  2.12,  The  given  assignment  of  variables  satisfies  the  3SAT  instance. 

Proof  of  Lemma  2.12.  Consider  the  cycle  Ef.  By  Lemma  2.10,  this  cycle  contains  some  cut  edge. 
But  by  construction  Ef  contains  the  vertex  (0,w,),  which  lies  in  B  (since  (l,wi)  G  A,  and  Lemma 
2.11  holds).  Thus  some  vertex  v  in  Ef  must  lie  in  A,  so  the  given  3SAT  assignment  must  give  the 
literal  corresponding  to  v  the  boolean  value  1.  Therefore  the  3SAT  clause  w,  is  satisfied.  □ 

From  Lemma  2.9  and  Lemma  2.12,  the  reduction  from  3SAT  to  MMCP  with  5  <  0  is 
shown  to  be  valid.  3SAT  is  known  to  be  iA^iP-complete  [CLR90].  Therefore,  under  the  restriction 
6  <  0,  MMCP  is  !]flP-hard.  □ 

2.3.2  MMCP  With  Positive  Cut  Set  Delay  Is  NP-Hard 

Theorem  2.13.  Eor  5  >  0,  MMCP  is  9{jP-hard. 

Proof  of  Theorem  2.13.  Again  by  reduction  from  3SAT.  The  construction  of  the  MMCP  instance 
differs  in  this  case,  however.  Let  Y  =  XU  {x  :  x  ^  X}  be  the  set  of  literals  (both  positive  and 
negative)  for  all  variables  in  X.  Now  define  sets 

=  WU{na,nh^ny,n^}  and  =  {W  x  {1,2,3})  U  {na,nh} 

where  {na,nh,ny,n^}  are  symbols  distinct  from  the  elements  of  IT  U  (IT  x  {1,2,3}).  These  new 
symbols  are  used  to  introduce  vertices  which  are  not  associated  with  any  clause  from  the  3SAT 
instance.  Also  note  the  Cartesian  product  IT  x  { 1 , 2, 3}  is  used  to  create  a  vertex  associated  with  the 
individual  literals  in  every  3SAT  clause. 

Construct  two  sets  of  vertices 

T^  =  TxlT^ 


and  =  {0, 1}  x  IT^ 
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where  again  0  and  1  represent  the  boolean  eonstants  0  and  1,  respectively.  Let  V  =  V  ^UV^. 
Construct  two  sets  of  edges 

=  {(u,v)  :  u  =  {y,Ws)  G  =  {y,wt)  G  V^,Ws,Wt  G  W^,Ws  /  Wt] 

£■2  =  {(m,v)  :  u  =  (y,w,)  g  L^,v  =  (y,Wj)  G  G  W^,Ws  /  wj 

joins  every  vertex  n  G  with  every  other  vertex  v  G  which  is  associated  with  the  same  literal 
as  u,  and  likewise  E^  joins  every  vertex  m  G  with  every  other  vertex  v  G  which  is  associated 
with  the  same  literal  as  u. 

Now  construct  a  third  set  of  edges 

E^  =E^^U...U  E^^^^  U  £^^1  U  . . .  U  £^^^| 

where 


E^rj  =  {{rji,rj2),{rj2,rj3),{rj3,rj4),{rj4,rji)) 

7e{i. 

rji  =  {xj,ny)  G 

1^1-1} 

0'2  =  {Xj+l,na)  G 

1^1-1} 

0'3  = 

1^1-1} 

rj4  =  {xj+i,nb)  G 

7  £  { 1 )  •  •  V 

1^1-1} 

^\X\l  —  {x\X\^f^y)  £ 


f\X\2  — 

r\x\3  =  {W\,ny)  G  L' 
r\x\4  =  ii,nb)  G 


Esj  =  {{^jl,^j2),{Sj2,Sj3),{Sj3,Sj4),{Sj4,Sjl)) 

7G{1. 

Sji  =  {Xj,n^)  G 

1^1-1} 

Sj2  =  {Xj+I,na)  G 

1^1-1} 

Sj3  =  G 

7£{l)--v 

1^1-1} 

Sj4  =  {Xj+i,nb)  G 

7  £  { 1 )  •  •  V 

|X|-1} 

■51X12  =  (0,nfl)  G 
■5|X|3  =  e 

^\X\4  =  (0,?!/,)  G 
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is  the  union  of  2|X|  disjoint  4-cycles  of  two  types,  E^j  and  E^j.  Each  cycle  E^j  is  either  an  alterna¬ 
tion  between  particular  vertices  from  associated  with  the  variable  xj  in  both  positive  and  negative 
literal  form  and  particular  vertices  associated  with  the  positive  literal  (fory  G  {1,...,|X|  —  1}), 
or  E^j  is  an  alternation  between  particular  vertices  from  V  ^  associated  associated  with  the  variable 
Xj  in  both  positive  and  negative  literal  form  and  particular  vertices  from  associated  with  boolean 
constant  1  (for  j  =  |X|).  The  cycles  E^j  are  similar  to  the  cycles  E^j,  except  that  they  contain  vertices 
associated  with  the  negative  literal  xJE\  instead  of  the  positive  literal  Xj+\  (for  y  G  { 1 , . . . ,  |X|  —  1}), 
or  constant  0  instead  of  1  (for  y  =  |X|).  The  vertices  chosen  for  these  cycles  are  associated  with  the 
elements  na,nh,ny,n^  so  that  each  4-cycle  is  on  a  set  of  vertices  disjoint  from  any  other  4-cycle  in 
E^. 

Construct  a  fourth  set  of  edges 

E^=Etu...UEfyy^ 

where,  for  all  /  G  { 1 , . . . ,  |  W | }, 

Ef  =  {{uii ,  Vi\ ) ,  (vn ,  ua),  {ui2,Vi2),  {ya,  ub),  {ub ,  v^) ,  {vb , mh ) ) 

Uij  =  {yij,Wi)  y'G  {1,2,3} 

v,■y  =  (l,(w,■,y•))GE^  7  £{1,2,3} 

E^  is  the  union  of  disjoint  6-cycles  Ef  which  connects  vertices  associated  with  3SAT  clause  w,-. 
Each  cycle  alternates  between  vertices  from  V  ^  associated  with  the  literals  which  appear  in  w,  and 
vertices  from  which  are  associated  with  the  boolean  constant  1 . 

Einally,  let  E  =  UE^.  Take  k  =  take  d{e)  =  0,  take  6=1,  and  take 

r  =  1  —  ^ 

■ 

An  example  is  shown  in  Eigure  2.2.  The  figure  shows  the  MMCP  instance  graph  con¬ 
structed  from  the  3SAT  formula  (xi  Vx2  VX3)  A  (xi  VX3)  A  (xf  V V2  VX3)  using  the  above  pro¬ 
cedure.  To  simplify  the  figure,  fhe  edges  from  E  ^  U  E^  are  nof  drawn;  insfead  fhese  edges  are 
represenfed  by  fhe  shaded  regions.  Eor  each  such  region,  every  pair  of  verfices  confained  fherein  is 
connecfed  in  a  2-cycle  wifh  edges  from  E^yjE\  Solid  lines  indicate  edges  confained  in  E^,  while 
dashed  lines  indicafe  edges  confained  in  E^.  As  wifh  fhe  previous  example,  fhe  reader  may  find 
if  insfrucfional  fo  idenfify  fhe  cycles  in  E^  fhe  figure  which  correspond  fo  fhe  clauses  in  fhe  3SAT 
insfance.  Examination  of  the  two  cycles  of  E^  which  connects  vertices  associated  with  the  variable 
x\  to  those  associated  with  X2  may  also  be  useful  for  edification. 
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Figure  2.2:  Example  MMCP  (positive  cut  set  delay)  instance  from  3SAT  reduction  {x\  A  (xi  Vx2  VX3)  A  (xf  Vx2  VX3).  Shaded 

regions  indicate  groups  of  vertices  which  are  pairwise  mutually  connected  with  edges  (2-cycles)  from  E  ^  U  ;  solid  lines  indicate  edges 
from  dashed  lines  indicate  edges  from  E^. 


20 


It  is  easy  to  see  that  this  reduetion  from  3SAT  to  MMCP  ean  be  done  in  polynomial  time; 
the  size  of  the  generated  graph  is  |P|  =  2|A|(|1V|  +4)  +2(3|1V|  +2)  and  l^l  =  jS'  |  +  + 

=  2|A|(|1V|  +4)(|W|  +3)  +2(3|1V|  +2)(3|1V|  +  1)  +4(2|A|)  +6|1V|.  Now  it  only  remains  to 
be  shown  that  this  reduetion  is  valid. 

Suppose  the  3SAT  instanee  has  some  satisfying  assignment.  As  before,  eonstruet  a  par¬ 
tition  (A,B)  of  V  where  A  eontains  those  vertiees  whose  eorresponding  literal  has  boolean  value  1 
under  the  given  assignment,  and  B  eontains  those  vertiees  whose  eorresponding  literal  has  boolean 
value  0.  Note  that  |A|  =  \B\  =  ''^  =k. 

Showing  that  this  partition  satisfies  the  MMCP  instanee  is  generally  similar  to  the  eorre¬ 
sponding  portions  of  the  proof  of  Theorem  2.6,  although  the  details  differ. 

Lemma  2,14.  Every  cycle  in  G  either  contains  an  edge  from  E'^VJE'^  or  consists  solely  of  edges 
from  a  single  cycle  from  E^  \JE^. 

Proof  of  Lemma  2.14.  If  a  eyele  in  G  does  not  eontain  any  edge  from  E^  VJE^,  then  it  ean  only 
eontain  edges  from  a  single  eyele  from  E^VJE'^,  as  E^  \JE^  is  the  union  of  disjoint  eyeles.  □ 

Lemma  2.15,  Suppose  {A,B)  is  a  partition  derived  from  a  satisfying  3  SAT  assignment  as  described 
above.  Eor  every  cycle  i  G  C{G),  the  partition  {A,B)  does  not  cut  some  edge  in  i. 

Proof  of  Lemma  2.15.  All  edges  of  E^  VJE'^  are  not  eut  by  (A,B),  sinee  all  sueh  edges  eonneet 
vertiees  assoeiated  with  a  literal  to  another  vertex  assoeiated  with  the  same  literal. 

Now  eaeh  eyele  in  E^  must  have  some  edge  whieh  is  not  eut.  Suppose  this  were  not  the 
ease;  that  is,  for  some  eyele  I  in  E^,  all  edges  of  £  are  eut.  Sinee  £  is  a  4-eyele,  there  are  two  vertiees 
in  1  whieh  lie  on  the  same  side  of  the  partition,  where  one  vertex  is  assoeiated  with  some  literal 
y  G  T  and  the  other  is  assoeiated  with  y,  a  eontradietion  sinee  sueh  vertiees  must  have  been  assigned 
to  different  sides  of  the  partition. 

Finally,  suppose  some  eyele  E^  has  no  edge  whieh  is  uneut.  Sinee  Ej  is  a  6-eyele 
with  every  seeond  vertex  assoeiated  with  the  boolean  eonstant  1,  all  other  vertiees  of  Ef  not  asso¬ 
eiated  with  the  boolean  eonstant  1  must  lie  in  B.  But  this  means  all  literals  in  the  3SAT  elause  w, 
were  assigned  the  value  0,  a  eontradietion,  sinee  this  elause  would  then  not  be  satisfied,  and  the 
assignment  must  satisfy  the  3SAT  formula. 


(A,B). 


From  the  above  and  Lemma  2.14,  every  eyele  must  have  some  edge  whieh  is  not  eut  by 

□ 
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Lemma  2,16.  For  every  cycle  I  G  C{G), 

'L{u,v)eed'{u,v) 

Proof  of  Lemma  2.16.  From  the  choices  of  d  and  5, 


d'{e) 


{1  if  e  is  cut  by  (A, B) 
0  otherwise 


Then 


L{u.v)eed\u,v)  _  c(£) 

1^1  ■  Kl 

where  c(£)  is  the  number  of  edges  of  £  which  are  not  cut.  The  result  then  falls  from  Lemma  2. 15.  □ 


Therefore  the  partition  derived  from  the  satisfying  assignment  for  3SAT  is  a  satisfying 
partition  for  MMCP. 

For  the  converse,  suppose  a  partition  (A,B)  of  V  is  given  which  satisfies  the  MMCP 
instance.  The  following  shows  that  a  satisfying  3SAT  assignment  can  be  derived  from  (A,B). 

Lemma  2.17,  Suppose  (A,B)  is  a  partition  which  satisfies  all  the  conditions  of  the  MMCP  problem. 
For  every  cycle  £  G  C{G),  the  partition  {A,B)  does  not  cut  some  edge  in  £. 


Proof  of  Lemma  2.17.  Again 


d\e)  =  { 


1  if  e  is  cut  by  (A, B) 
0  otherwise 


Now  since  (A,B)  satisfies  fhe  MMCP  instance, 

ll(u,v)eedfu,v) 


1^1 


so  some  edge  in  £  must  not  be  cut  by  (A,B). 


□ 


Lemma  2.18.  For  all  y  G  TjWa  G  ,Wb  G  W^,  the  vertex  u  =  {y,Wa)  G  lies  in  A  iff  the  vertex 
V  =  {y,Wb)  G  lies  in  A.  Also,  for  all  y  G  {0,1},  Wq  G  W^,Wb  G  W^,  the  vertex  u  =  (y,Wa)  G 
lies  in  A  iff  the  vertex  v  =  (y,Wb)  G  lies  in  A. 


Proof  of  Lemma  2.18.  Directly  from  Lemma  2.17  and  the  fact  that  U  contains  the  2-cycle 
((m,v),(v,m)).  □ 
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So  far,  this  proof  has  been  generally  similar  to  that  of  Theorem  2.6.  However,  unlike  the 
previous  proof,  we  have  not  yet  demonstrated  that  a  eonsistent  assignment  of  SAT  variables  ean  be 
obtained.  The  following  lemma  shows  that  the  vertiees  assoeiated  with  any  given  literal  must  lie  on 
the  opposite  side  of  the  partition  from  the  vertiees  assoeiated  with  the  negation  of  that  literal,  due  to 
the  partieular  strueture  of  the  eyeles  in  E^. 

Lemma  2.19.  For  all  y  G  X,Wa  ,Wb  G  the  vertex  u  =  (y,Wa)  ^  lies  in  A  iff  the  vertex 
V  =  {y,Wb)  G  lies  in  B.  Also,  for  all  Wa  G  W^,Wb  G  W^,  the  vertex  u  =  (0,  Wq)  G  V'^  lies  in  A  iff 
the  vertex  v  =  G  lies  in  B. 

Proof  of  Lemma  2.19.  First  eonsider  variable  xi  G  X,  and  suppose  u  =  {x\,Wa)  G  lies  on  the 
same  side  of  the  partition  v  =  {x\,Wb)  G  for  some  Wa,Wb  G  W^.  WLOG,  suppose  u,v  ^  A.  Then 
by  Lemma  2.18,  the  vertices 

=  {xi,ny)  G  and  =  {xi,ny)  G 


must  both  lie  in  A  as  well. 

Now  consider  the  4-cycle  .  This  cycle  contains  the  vertices 


u^*  =  {x2,na)  and  v^*  =  (x2,nfo)  G 


In  particular. 


Since  both  G  A  and  G  A,  then  by  Lemma  2.17,  either  u^*  G  A  or  v^*  G  A.  But  then  by 
Lemma  2.18,  {{x2,Wa)  :  Wa  G  C  A.  A  similar  argument  using  E^^  yields  {{x2,Wa)  -Wa  G  }  C 
A. 

Now  vertex  =  (x2,ny)  G  A  and  =  {x2,ny)  G  A.  Examining  E^2  E^2  with  the  same 

argument  as  above  yields  the  conclusion  {{x2,Wa)  ■  G  fk'}  E  A  and  {(x3,Wa)  :  Wa  G  C  A. 
Continuing  this  reasoning,  eventually  the  conclusion  is  that  A  =  F,  which  is  a  contradiction,  since 
(A,B)  is  a  satisfying  MMCP  partition  and  |A|  <k  =  Thus  the  original  supposition  that  u  and  v 
lie  on  the  same  side  of  the  partition  must  be  false. 

Now  induction  can  be  used.  Given  that  the  lemma  is  true  over  all  vertices  corresponding 
to  variables  xj,  J  €  {1, . . .  ,m},  the  lemma  must  also  be  true  for  vertices  corresponding  to  Xm+i,  since 
assuming  otherwise  yields  the  contradiction  |A|  >  k  using  the  same  argument  as  above.  Similarly, 
the  lemma  must  hold  true  for  the  vertices  in  F^.  □ 
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Lemmas  2.18  and  2.19  show  that  a  consistent  variable  assignment  for  the  3SAT  problem 
can  be  obtained  from  the  given  partition  {A,B).  WLOG  suppose  vertex  (!,««)  G  A.  Then  choose 
the  3SAT  variable  assignment  such  that  the  literal  y  takes  on  the  value  1  iff  {y,na)  G  A. 

Lemma  2.20.  The  given  assignment  of  variables  satisfies  the  3 SAT  instance. 

Proof  of  Lemma  2.20.  Suppose  some  3SAT  clause  w,  is  unsatisfied.  Then  the  cycle  Ef  must  alter¬ 
nate  between  vertices  in  A  and  vertices  in  B,  since  all  literals  y  G  { 1 , 2, 3}  for  that  clause  take  on 
the  value  0.  But  then  every  edge  in  the  cycle  Ef  is  cut,  which  contradicts  Lemma  2.17.  Therefore 
all  3SAT  clauses  must  be  satisfied.  □ 

Lemma  2.16  and  Lemma  2.20  show  fhaf  fhe  reducfion  from  3SAT  fo  MMCP  is  valid, 
Iherefore,  under  fhe  resfricfion  fhaf  5  <  0,  MMCP  is  fA^iP-hard.  □ 

2.3.3  MMCP  Is  NP-Complete 

Theorem  2.21.  MMCP  is  9{jP-hard. 

Proof  of  Theorem  2.21.  Direcfly  from  Theorem  2.6  and  Theorem  2.13.  □ 

Sfricfly  speaking,  only  one  of  Theorem  2.6  or  Theorem  2.13  is  necessary  fo  prove  Theo¬ 
rem  2.21.  However,  since  fhe  fwo  cases  have  differenl  physical  inferprefafions,  and  fhe  graphs  used 
for  each  case  differ  significanlly,  bofh  are  presenfed  here  for  complefeness. 

Theorem  2.22.  MMCP  is  in  9fjP. 

Proof  of  Theorem  2.22.  For  fhe  decision  problem  of  MMCP,  if  fhe  oufpuf  is  1  fhe  parfifion  (A,B) 
which  salisfies  MMC(G,(i')  <  T  serves  as  a  cerfificafe  which  can  be  checked  in  polynomial  lime, 
since  fhe  labeling  funclion  d'  can  be  computed  in  polynomial  lime  given  (A,B),  and  MMC(G,r/') 
can  be  computed  in  polynomial  lime  as  well.  □ 

Theorem  2.23.  MMCP  is  UfjP -complete. 


Proof  of  Theorem  2.23.  From  Theorem  2.21  and  Theorem  2.22. 


□ 
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2.3.4  Generalizations 


Recall  that  the  circuit  model  used  in  this  chapter  has  registers  represented  by  the  vertices 
of  G,  and  combinational  gates  abstracted  into  the  delays  represented  by  the  labeling  d.  That  is,  the 
MMCP  problem  considered  here  was  only  that  of  partitioning  the  registers  of  the  circuit.  However, 
it  is  possible  to  extend  the  above  result  for  the  case  where  the  combinational  elements  are  also  to 
be  included  in  the  partitioning  problem.  For  this,  the  maximum  profit-to-time  ratio  metric  can  be 
substituted  for  the  MMC  in  the  above.  Suppose  the  edges  of  G  are  labeled  with  the  function  x:  E  ^ 
M  which  represents  the  number  of  registers  associated  with  that  edge.  Purely  combinational  circuit 
elements  would  have  a  corresponding  x(e)  =  0,  while  individual  registers  would  have  x(e)  =  1.  The 
maximum  profit-to-time  ratio  is  then 


'L(u,v)eid{u,v) 

^eT(G) 


max 


This  metric  is  a  generalization  of  the  MMC  and  it  is  straightforward  to  extend  the  analysis  in  this 
chapter  to  use  this  more  general  circuit  model  instead  [DG98,  DIG98,  CTCG+98]. 

It  is  also  a  straightforward  extension  to  consider  the  case  where  the  vertices  have  weights 
^(v)  :  V  €  V  associated  with  them  (i.e.  corresponding  to  the  sizes  of  their  respective  cells),  so  that 
the  partition  balance  conditions  become 


£  5(v)  <k  and  £  5(v)  <  k 
VGA  veB 

However,  the  model  using  unweighted  vertices  provides  some  interesting  insight  into  where  the 
complexity  of  MMCP  originates.  The  iA^iP-completeness  of  the  traditional  integer  partitioning  prob¬ 
lem  [GJ79]  comes  about  because  of  the  varying  sizes  of  the  elements  to  be  partitioned.  With  equal 
element  sizes,  integer  partitioning  can  be  solved  trivially.  This  suggests  that  the  source  of  complex¬ 
ity  of  the  MMCP  problem  arises  fundamentally  from  the  delay  constraints,  rather  than  the  partition 
balance  conditions. 

It  should  be  noted  that  the  delay  model  used  in  the  above  analysis  is  very  simplistic,  in  that 
cut  edges  are  uniformly  given  a  fixed  addifional  delay  5.  Cerfainly  more  sophisficafed  delay  models 
such  as  fhe  geomefric  embedding  model  proposed  in  [LimOO]  would  be  more  realistic.  Allhough  no 
slraighlforward  extension  of  fhe  above  analysis  lo  use  such  delay  models  is  known,  if  is  reasonable 
fo  believe  fhal  using  a  more  complex  delay  model  will  nol  simplify  fhe  parfilioning  problem. 
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2.4  Summary 


In  this  chapter,  an  optimization  problem  combining  partitioning  with  sequential  timing 
constraints  was  shown  to  be  be  iA^tP-complete.  This  motivates  the  need  to  develop  heuristics  for 
physical  design  under  sequential  timing  constraints. 
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Chapter  3 

Floorplanning  With  Retiming 
Constraints 

3.1  Introduction 

Retiming  has  the  well-known  property  that,  for  any  retiming,  the  number  of  registers  in 
any  struetural  eyele  in  the  eireuit  must  remain  eonstant.  The  eonverse  of  this  is  not  generally  true, 
however.  That  is,  given  an  original  eireuit  and  a  target  eireuit  whieh  is  both  strueturally  identieal  and 
preserves  the  number  of  registers  in  every  eyele  in  the  underlying  graph,  it  is  not  always  possible  to 
generate  the  target  eireuit  from  the  original  in  a  manner  whieh  preserves  funetionality.  However,  this 
eonverse  property  does  hold  for  a  speeifie  elass  of  eireuits,  in  partieular,  eireuits  whose  underlying 
graphs  are  strongly-eonneeted. 

In  this  ehapter,  a  novel  proof  of  this  eonverse  property  is  presented.  A  praetieal  applieation 
of  this  theorem  is  shown  in  a  non-iterative  approaeh  to  eombining  retiming  and  die-level  floorplan¬ 
ning  for  deep-submieron  designs.  Experimental  results  show  notable  improvement  in  eloek  eyele 
times  aehievable  using  this  teehnique. 

3.1.1  Definitions 

For  this  ehapter,  we  model  designs  using  direeted  graphs  with  edge  labels.  The  semanties 
of  the  graphs  in  this  ehapter  differ  from  that  of  Chapter  2.  Here,  the  edge  labels  eount  the  number 
of  registers  or  sequential  elements  present  on  the  communication  paths  between  the  other  circuit 
elements,  which  are  represented  by  the  vertices. 
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Let  G  =  {V,E)  hen  finite  strongly-eonneeted  direeted  multigraph. 

Definition  3.1  (Path).  A  path  P  of  length  |P|  in  G  is  a  nonempty  ordered  list  of  edges 
P  =  {{ui,Vi),{u2,V2),...,{u\p\,V\p\))  {Ui,Vi)  €£■,/€  {1,...,|P|} 
sueh  that  v,-  =  G  {1, . . . ,  |P|  —  1}, 

Definition  3.2  (Cycle).  A  cycle  in  G  is  a  path  which  begins  and  ends  at  the  same  vertex;  that  is, 
V|p|  =  Ml  using  the  notation  above. 

The  strongly-connected  property  of  G  is  equivalent  to  asserting  that  every  edge  in  E  par¬ 
ticipates  in  at  least  one  cycle. 

Definition  3.3  (Self-Intersecting  Cycle).  A  cycle  which  contains  some  edge  more  than  once  is  said 
to  be  self-intersecting. 

Note  that  here  we  use  the  term  cycle  to  generally  include  self-intersecting  cycles.  This  is 
in  contrast  to  Chapter  2,  where  the  term  referred  to  simple  (non-self-intersecting)  cycles  only.  We 
will  later  see  that  we  can  actually  ignore  self-intersecting  cycles  for  our  purposes. 

For  any  subset  V'  C  V,  define 

a(F')  =  {{u,v)  eE:u^V',veV'} 

^(y')  =  {{u,v)GE  :u£V',v^V'} 

That  is,  a(F')  is  the  set  of  edges  which  enter  V’  and  P(F')  is  the  set  of  edges  which  leave  V\ 

Let  S  :  E  -f-Z  and  S’ :  E  ^Zbe  labelings  of  the  edges  of  G  with  integers. 

Definition  3.4  (Retiming  Operation).  For  all  F'  C  G  Z,  we  define  the  retiming  operation 
^iy' ,x)  to  be  the  function  <I>(F',x)  :  S  ^  S'  such  that 

/ 

S{e)-\-x  if  e  G  a(F') 

S\e)  =  <  s{e)-x  ifcGp(F') 

S{e)  otherwise 

Our  notion  of  retiming  here  is  a  somewhat  generalized  restatement  of  the  original  defi¬ 
nition  of  retiming  proposed  in  [LS91].  The  original  definition  is  equivalent  to  the  above  with  the 
additional  restriction  |F'|  =  1. 
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Definition  3.5  (Compatible  Cycle).  A  cycle  £  in  G  is  cycle  compatible  with  respect  to  S  and  S'  iff 


eGt  eG-t 

Definition  3.6  (Compatible  Labelings).  S  and  S'  are  said  to  be  compatible  labelings  iff  all  cycles 
in  G  are  compatible  with  respect  to  S  and  S'. 


For  simplicity  of  notation,  we  make  use  of  a  particular  labeling  T  :  £  ^  Z  of  the  edges  of 
G  with  integers.  We  call  T  the  target  labeling  for  reasons  which  will  become  clear  later. 

Definition  3.7  (Edge  Satisfaction).  An  edge  e  G  E  is  edge  satisfied  by  S  iff  S{e)  =  T{e). 

Definition  3.8  (Path  Satisfaction).  A  path  P  is  path  satisfied  by  S  iff 

Y^S{e)  =  Y^T{e) 

eeP  eeP 

Note  that  the  concatenation  of  two  satisfied  paths  is  also  a  satisfied  path. 

Definition  3.9  (Cycle  Satisfaction).  A  cycle  ^  is  cycle  satisfied  by  S  iff 

Y^S{e)  =  Y^T{e) 

eGt 

Note  that  cycle  satisfaction  is  simply  the  same  as  cycle  compatibility  with  S'  =  T . 

In  the  following,  we  will  simply  use  the  terms  compatible  and  satisfied  whenever  the 
referred  type  is  already  clear  from  the  context. 


3.2  Theoretical  Results 

Claim  3.10.  IfV'  =  {vi, . . .  ,V|y/|},  then 

^{V',x)  =  <I>({vi},x)  o  . . .  0<l>({v|y/|},x) 

That  is,  the  retiming  operation  <I>(F',x)  can  be  formed  by  applying  retiming  on  the  individual  ver¬ 
tices  in  V'. 

Proof  of  Claim  3.10.  Consider  any  edge  e  =  (m,v)  G  E.  If  m  0  F',v  G  V,  then  e  G  a(F')  and 

{<l>({vi},x)  o  . .  .o<I)({v|y/|},x)}  (5)(e)  =  <I>({v},x)(5)(e) 

=  S{e)  +x 
=  <I>(l/',x)(5)(e) 
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Uu£V',v^V',  then  ee^{V')  and 

{<I>({vi},x)  o  . .  .o<I)({v|y.|},;c)}  (5)(e)  =  <!>({«}, x) (5) (e) 

=  S{e)  —X 
=  ^{V',x){S){e) 

If  M  G  V',v  G  V\  then 

{<I>({vi},x)  o  . .  .o<I)({v|y,|},x)}  (5)(e)  =  {<!>({«}, x)  o<I>({v},x)}  (5)(e) 

=  S{e)  -x  +  x 
=  S{e) 

=  ^{V',x){S){e) 

Finally  if  n  0  F',v  0  F',  then 

{<I>({vi},x)  o  . .  .o<l>({v|y,|},x)}  (5)(e)  =  5(e) 

=  <I>(F',x)(5)(e) 

□ 

Claim  3.10  shows  the  equivalenee  of  our  notion  of  retiming  with  that  of  [LS91],  in  that  the 
retiming  operation  <1>  ean  be  deeomposed  into  the  retiming  transformations  proposed  in  that  work. 
Let  5  :  ^  Z  and  5' :  ^  Z  be  labelings  of  the  edges  of  G  with  integers. 

Claim  3.11.  IfS  is  compatible  with  S',  then  <I>(F',x)(5)  is  a  labeling  compatible  with  S'  as  well. 

Proof  of  Claim  3.11.  Consider  any  eyele  i  in  G.  |^na(F')|  =  |£nP(F')|,  so 

£5(e)  =  £<I>(F',x)(5)(e) 
eet  eet 

Thus  S  is  eompatible  with  <I>(F',x)(5).  Compatibility  of  labelings  is  an  equivalenee  relation,  so 
<l>(F',x)(5)  must  be  eompatible  with  S'.  □ 

Claim  3.11  shows  that  the  retiming  operation  preserves  eompatibility  of  all  eyeles.  This 
result  also  falls  out  from  the  deeomposition  of  retiming  operations  into  single-vertex  retiming  oper¬ 
ations  (Claim  3.10)  and  [LS91]. 


Theorem  3.12.  Labelings  S  and  S'  are  compatible  iff  all  non-self-intersecting  cycles  i  in  G  are 
compatible  with  respect  to  S  and  S'. 
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Proof  of  Theorem  3.12.  One  direction  (“only  if”)  is  trivial.  For  the  other  direction  (“if”),  use  in¬ 
duction  on  the  length  of  a  given  cycle.  For  the  base  case,  let  £  be  any  cycle  of  length  2.  I  cannot 
be  self-intersecting,  so  i  must  be  compatible  with  respect  to  S  and  S' .  Now  consider  a  cycle  1  of 
length  \l\  >  2,  and  suppose  all  cycles  of  length  less  than  \i\  are  compatible  with  respect  to  S  and  S' 
(induction  hypothesis).  If  1  is  non-self-intersecting,  1  must  be  compatible  with  respect  to  S  and  S' . 
If  £  is  self-intersecting,  there  is  an  edge  e,  which  is  visited  at  least  twice,  so  we  can  write 

(.  =  {ei,.. . . . .,ej-\,ej  =  ei,ej+i,. . . ,e|£|) 

for  some  j  /  i.  We  can  then  decompose  £  into  two  cycles 

l\  =  (ei, . . .  . . .  ,e|£|) 

^2  —  {pi+li  ■  ■  ■  1^ j  —  ^i) 

SO  that 

£  5(e) +£  5(e)  =  £  5(e) 
eeii  eek  eei 

£5'(e)+£5'(e)  =  £5'(e) 

eeii  eeh  eei 

But  and  £2  are  cycles  of  length  less  than  n,  so  by  the  induction  hypothesis 

£  5(e)  =  £  5'(e) 

eetl  eeii 

£  5(e)  =  £  5'(e) 

eeh  ee£2 

and  so  £  must  be  compatible  with  respect  to  5  and  S'.  □ 

Theorem  3.12  indicates  that  we  can  ignore  self-intersecting  cycles  in  our  analysis,  as 
compatibility  of  all  simple  cycles  implies  compatibility  of  all  cycles,  including  self-intersecting 
cycles. 

Theorem  3.13.  Let  5  and  T  be  compatible  integer  labelings  of  the  edges  ofG.  Then  there  exists  a 
finite  sequence  of  successive  retiming  operations  which  satisfies  all  edges  (i.e.  transforms  S  to  T). 

Proof  of  Theorem  3.13.  For  the  following  discussion,  we  introduce  two  auxiliary  definitions. 

Definition  3.14  (a-Path).  A  o-path  of  length  n  from  vi  to  v„  with  respect  to  5  is  an  ordered  list  of 
vertices  (vi, . . .  ,v„),  v,-  G  V,  such  that,  for  all  1  <  /  <  n  —  1,  there  exists  some  edge  e,-  G  E  which  is 
satisfied  by  5,  where  eifher  e,-  =  (v,,  v,+i)  or  e,-  =  (v,+i,  v,). 


31 


Note  that  if  /*  is  a  a-path  exists  from  vi  to  v„,  this  does  not  imply  that  is  a  satisfied  path 
with  respeet  to  S.  Moreover,  the  existenee  of  P  does  not  even  imply  there  exists  a  direeted  path  from 
V]  to  v„  in  the  original  graph  G.  However,  P  does  eorrespond  to  a  path  in  the  underlying  undireeted 
graph  between  vi  and  v„,  where  all  the  eorresponding  direeted  edges  are  satisfied  by  S. 

Definition  3.15  (a-Closure).  For  any  v  G  F,  define  the  a-closure  of  v  with  respeet  to  S  to  be  the  set 
of  vertiees 


K{S,v)  =  {v}  U  {v'  G  F  :  there  exists  an  a-path  from  v  to  v'  with  respeet  to  5} 

Given  these  definitions  we  now  proeeed  with  the  proof. 

If  S  satisfies  all  edges,  then  S  =  T  and  we  are  trivially  done.  Otherwise  there  exists  some 
edge  e*  =  {u* ,v*)  G  E  for  whieh  S{e*)  >  T{e*).  Let  V*  =  K{S,v*). 

Lemma  3.16.  u*  0  F*. 

Proof  of  Lemma  3.16.  Suppose  n*  G  F*.  Then  by  definition  of  a-elosure  there  exists  an  a-path  from 
V*  to  u* 

p=  (vi  =v*,...,Vn  =  U*) 

in  G  with  respeet  to  S.  Now  eonsider  the  eonstruetion  of  an  ordinary  direeted  path  P'  from  v*  to  u* 
by  examining  every  pair  of  vertiees  (v,,  v,+i),  1  <  /  <  n  —  1  in  the  order  they  appear  along  P.  Sinee 
P  is  an  a-path,  there  must  exist  some  edge  e,  G  £",  1  <  /  <  n  —  1,  where  e,  is  satisfied  by  S  and  for 
whieh  one  of  the  following  two  eases  must  hold: 

•  Case  1:  e,-  =  (v,,v,+i).  In  this  ease  we  append  the  edge  e,  to  P'. 

•  Case  2:  e,-  =  (v,+i,v,).  Sinee  every  edge  partieipates  in  at  least  one  eyele,  there  is  a  eyele 
£i  eontaining  e,-.  Sinee  S  and  T  are  eompatible,  li  is  eyele-satisfied  by  S.  Sinee  e,  is  edge- 
satisfied  as  well,  then  the  path  £,•  \  (e,)  (that  is,  the  path  obtained  by  removing  the  single  edge 
a  from  ii)  must  also  be  satisfied.  Insfead  of  appending  a  fo  P' ,  we  append  the  path  li  \  {a). 

Sinee  P'  is  eonstrueted  by  adjoining  satisfied  edges  and  satisfied  paths,  P'  must  be  path- 
satisfied.  Now  appending  e*  fo  P'  gives  a  eyele  I',  and  I'  musf  be  satisfied  sinee  S  and  T  are 
eompatible.  But  then  l'\P'  =  {e*)  must  be  satisfied  as  well,  sinee  bofh  I'  and  P'  are  satisfied.  This 
is  a  eontradietion,  as  e*  was  ehosen  to  be  an  unsatisfied  edge.  Thus  n*  0  F*.  □ 
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From  Lemma  3.16,  u*  0  V*.  But  v*  G  V*,  so  e*  G  a(F*).  Now  take  x  =  T{e*)  -  S{e*). 

Note  that 

^{V*,x){S){e*)  =  S{e*)  +x  =  T{e*) 

so  applying  <F(F*,x)  to  S  yields  a  labeling  where  e*  is  satisfied. 

Now  we  consider  the  other  edges  of  E. 

Lemma  3.17.  If  edge  e'  £  E  is  satisfied  by  S,  then  e'  is  also  satisfied  by  <I>(F*,x)(5'). 

Proof  of  Lemma  3.17.  Suppose  e'  =  {u',v')  G  a(y*).  By  definition  of  a,  u'  0  V*  and  v'  G  V*.  But 
since  v'  G  V*  and  e'  is  satisfied  by  S,  u'  £V*,  by  definition  of  a-closure.  This  is  a  contradiction,  so 
e'  0  a(F*).  A  similar  argument  shows  e'  0  P(F*).  Thus  <I>(L*,x)(5')(e')  =  S{e')  =  T{e').  □ 

By  Lemma  3.17,  the  number  of  edges  satisfied  by  <I>(F*,x)(5')  musf  be  sfrictly  greater 
than  the  number  of  edges  satisfied  by  S.  Also,  Claim  3.11  shows  fhat  <I>(F*,x)(5')  is  compatible 
with  T,  since  S  is  compatible  with  T .  Therefore,  the  sequence 

So  =  S 

Si=<^iVo*,xo){So) 

S2  =  <^iV,*,xi){Si) 


converges  to  T  in  a  finite  number  of  steps,  where  V*  and  x,  are  the  values  of  F*  and  x*  computed 
according  to  the  analysis  above  under  the  labeling  5,.  That  is,  we  can  successively  apply  the  retiming 
operation  <I>(F*,x)  to  S  in  order  to  obtain  T.  □ 

Theorem  3.13  indicates  that,  if  a  target  retiming  preserves  the  number  of  registers  for 
every  cycle  in  G,  then  there  exists  a  finite  sequence  of  retiming  operations  which  can  generate  this 
target.  Moreover,  as  the  proof  of  this  theorem  is  constructive  in  nature,  we  have  a  procedure  to 
obtain  the  desired  sequence  of  retiming  operations.  Thus,  the  constraint  that  the  number  of  registers 
remain  constant  for  all  cycles  in  G  is  both  necessary  and  sufficient  for  any  valid  retiming  of  G. 

3.2.1  Existing  Work 

This  theory  was  initially  developed  in  the  belief  that  it  was  a  completely  novel  result 
in  the  area  of  retiming.  However,  it  was  subsequently  found  that  the  same  result  was  derived  in 
[SSBSV92].  Our  proof  of  the  necessary  and  sufficient  conditions  for  valid  retimings  differs  from 
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that  given  in  [SSBSV92],  most  notably  in  two  ways.  First,  [SSBSV92]  only  shows  that  the  tem¬ 
porality  (i.e.  sequential  behavior)  of  a  eireuit  is  maintained  given  the  number  of  registers  in  any 
cyele  is  fixed.  However,  in  that  work  it  was  not  shown  that  a  sequenee  of  valid  retiming  operations 
exists  whieh  transforms  a  eireuit  into  a  target  temporally-equivalent  eireuit.  The  proof  presented 
here  shows  sueh  a  sequenee  does  indeed  exist.  Seeond,  our  proof  is  eonstruetive,  so  this  sequenee 
of  retiming  operations  is  explieitly  determined.  Our  proof  thus  provides  a  more  insightful  view  of 
this  eharaeterization  of  valid  retimings. 

3.2.2  Practical  Considerations 

One  might  argue  that  the  eondition  that  G  be  strongly-eonneeted  may  be  too  restrietive 
in  praetiee.  However,  note  that  in  a  design  with  no  redundaneies,  every  edge  must  reaeh  some 
primary  output  by  some  path  in  G;  edges  whieh  do  not  eonneet  to  a  primary  output  ean  be  removed 
without  affeeting  the  eireuit  behavior.  Additionally,  G  ean  be  easily  modified  so  fhaf  all  edges  ean 
be  reaehed  from  some  primary  inpul.  Thus,  wilh  suifable  modifiealion  lo  G,  edges  whieh  do  nol 
parlieipale  in  a  eyele  musl  lie  on  some  palh  from  a  primary  inpuf  lo  a  primary  oulpuf.  Henee,  adding 
regislers  on  sueh  edges  only  eonlribules  lo  Ihe  overall  laleney  of  Ihe  eireuil,  and  does  nol  ehange 
Ihe  sequential  behavior  of  Ihe  eireuil  in  olher  ways.  By  ignoring  Ihese  edges  during  retiming  (i.e. 
if  we  perform  retiming  on  eaeh  slrongly-eonneeled  eomponenl  of  G  independenlly,  we  Ihus  allow 
Ihe  overall  laleney  lo  grow  unbounded,  bul  do  nol  affeel  Ihe  funelionalily  of  Ihe  eireuil  beyond  ils 
laleney. 

An  alternate  approaeh  lo  dealing  wilh  Ihe  eondition  lhal  G  be  slrongly-eonneeled  is  lo 
inlroduee  an  additional  vertex  Vext  as  a  “hosl  node”  in  Ihe  graph,  representing  Ihe  environment 
external  to  the  eireuit.  Additional  edges  from  vext  to  the  primary  inputs  of  the  eireuit,  and  from  the 
primary  outputs  of  the  eireuit  to  vext  are  also  added.  Now  in  this  new  graph,  all  edges  partieipate  in 
some  eyele  with  vext.  This  host  node  approaeh  is  eommonly  used  in  the  literature,  e.g.  [SSBSV92, 
CLOO]. 

Reeall  in  the  proof  of  Theorem  3.13  that  the  eonstruetion  of  the  first  retiming  operation 
in  the  sequenee  whieh  transforms  the  labeling  S  to  the  labeling  T  begins  with  the  seleetion  of  some 
edge  e*  sueh  that  S{e*)>T{e*).  By  using  this  seleetion  eriterion,  this  ensures  that  the  eorresponding 
retiming  operation  <I>(F*  ,x)  whieh  satisfies  e*  has  x  <  0.  This  eorresponds  to  a.  forward  retiming  op¬ 
eration,  where  registers  are  removed  from  the  inputs  of  F*  and  added  to  the  outputs  of  F*.  Forward 
retiming  operations  are  generally  preferred  in  praetiee,  as  these  ean  be  easily  realized.  On  the  other 
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hand,  backward  retiming  operations,  where  registers  are  removed  from  the  outputs  of  V*  and  added 
to  the  inputs  of  V*,  may  not  be  implementable  without  modifying  reset  behavior  [TB93,  ESS96]. 
This  eriterion  on  the  seleetion  of  e*  thus  avoids  the  problems  assoeiated  with  baekward  retiming 
operations. 

3.3  Floorplanning  Application 

Here  we  deseribe  an  applieation  of  our  retiming  theory  to  the  generation  of  die-level 
floorplans  under  deep-submieron  physieal  eonditions.  Under  eurrent  teehnology  trends,  wire  delays 
are  predieted  to  grow  sueh  that  multiple  eloek  eyeles  will  be  required  for  die-level  intereonneets 
[Ass97].  Thus,  the  task  of  generating  a  feasible  floorplan  is  tightly  eoupled  with  retiming,  as  the 
flexibility  of  plaeement  of  modules  is  direetly  related  to  the  assignment  of  registers  to  the  intereon¬ 
neets.  That  is,  intereonneets  with  more  registers  may  be  made  longer  than  intereonneets  with  few 
registers. 

3.3.1  Problem  Formulation 

Designs  are  modeled  as  a  direeted  multigraph  G,  where  the  nodes  of  the  graph  represent 
large  maerobloeks,  possibly  millions  of  gates  eaeh.  Here,  the  bloeks  represent  sequential  logie, 
though  we  ignore  the  registers  internal  to  the  bloeks,  as  we  are  only  eoneerned  with  retiming  the 
registers  on  the  bloek-to-bloek  intereonneets. 

Each  macroblock  n  has  an  associated  layout  area  A„,  though  the  aspect  ratio  (width/height) 
of  each  block  may  be  flexible.  As  typical,  we  assume  rectangular  blocks  only,  for  simplicity.  The 
edges  represent  the  interconnect  between  blocks;  each  edge  may  be  a  bus  consisting  of  many  wires, 
though  for  our  purposes  we  abstract  this  detail  away.  We  are  also  given  the  edge  labeling  ^(e), 
representing  an  initial  assignment  of  registers  to  the  interconnects;  this  represents  an  initial  de¬ 
sign/retiming  which  satisfies  the  functionality  required  by  the  system. 

Our  problem  is  thus:  find  a  location  (x„,y„)  and  width  and  height  {wn,hn)  for  each  mac¬ 
roblock  n,  and  retiming  T{e)  compatible  with  S{e)  such  that 

•  Wfihfi  A 


•  No  maerobloeks  overlap 
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•  For  edge  e  =  {m,n),  T{e)  >  \_d{m,n)/(Sf\,  where  d{m,n)  is  the  intereonneet  delay  frombloek 
m  to  bloek  n,  and  (|)  is  the  eloek  period 

We  estimate  the  intereonneet  delay  d{m,n)  using  a  number  of  assumptions.  First,  we  as¬ 
sume  intereonneet  delay  is  linearly  related  to  distanee,  given  optimally  buffered  lines  as  in  [OB98]. 
Optimal  buffering  is  a  reasonable  assumption  at  this  level,  given  the  long  ehip-level  intereonneet  we 
are  dealing  with  here.  Seeond,  we  use  the  Manhattan  distanee  D{m,n)  between  the  eenters  of  the 
maerobloeks  as  an  estimate  for  the  length  of  the  intereonneet.  Third,  we  assume  the  maerobloeks 
are  all  Moore  maehines,  with  registered  outputs.  This  assumption  is  made  to  ensure  that  eaeh  inter¬ 
eonneet  ean  be  treated  independently;  in  the  presenee  of  eombinational  paths  whieh  span  multiple 
ehip-level  intereonneets,  the  inequality  eonstraint  for  T  (e)  given  above  does  not  neeessarily  hold. 
Given  the  large  size  of  the  maerobloeks,  it  is  reasonable  to  assume  they  will  be  designed  in  some 
regular  fashion,  so  the  Moore  maehine  assumption  seems  to  be  justified.  For  simplieity,  we  assume 
our  registers  have  no  intrinsie  propagation  delays  assoeiated  with  them,  although  we  ean  easily 
model  sueh  effeets  by  subtraeting  sueh  a  delay  from  the  eloek  period  (|).  Finally,  we  assume  there 
is  a  eombinational  delay  t{m,n)  assoeiated  with  the  inputs  to  bloek  n  eoming  from  bloek  m;  this 
models  the  faet  that,  for  a  general  Moore  maehine,  there  may  be  eombinational  logie  between  an 
input  and  a  register.  Based  on  this  we  have: 

d{m,n)  =  KD(m,n)  +t{m,n) 

for  some  eonstant  K  relating  distanee  to  time;  K  depends  on  the  physieal  teehnology  used  for  the 
ehip  itself. 

3.3.2  Procedure 

Given  that  retiming  constraints  (the  number  of  registers  around  eaeh  eyele  in  G)  are  suf- 
fieient  to  deseribe  all  valid  retimings  of  G,  one  might  hope  to  translate  these  into  plaeement  eon- 
straints  whieh  are  amenable  for  a  lloorplanning  or  plaeement  tool  to  use.  If  the  plaeement  obeys 
sueh  plaeement  eonstraints,  the  end  result  will  be  guaranteed  to  have  a  valid  retiming.  Here  we  pro¬ 
pose  a  heuristie  method  to  try  to  satisfy  the  retiming  eonstraints;  as  it  is  a  heuristie,  the  final  result 
may  not  satisfy  all  the  eonstraints  properly.  However,  note  that  we  may  always  reduee  the  eloek 
frequeney  in  order  to  satisfy  the  retiming  eonstraints;  inereasing  the  eloek  period  allows  violated 
eonstraints  to  beeome  met.  Thus,  for  our  methodology,  the  maximum  eloek  frequeney  (or  minimum 
eloek  period)  at  whieh  the  retiming  eonstraints  are  satisfied  beeomes  a  mefrie  for  fhe  qualify  of  fhe 
floorplan. 


36 


Our  floorplanning  procedure  generates  a  slicing  structure  derived  using  recursive  minciit 
partitioning  [Bre77,  Ott82,  WL86].  The  implementation  for  our  algorithm  uses  HMetis  [KK99] 
as  the  core  mincut  partitioned  We  build  the  slicing  tree  recursively,  using  mincut  partitioning  to 
determine  the  structure  of  the  tree  at  each  level.  Using  mincut  partitioning  in  this  manner  acts 
effectively  as  a  heuristic  for  minimization  of  wirelength;  as  a  minimum  number  of  edges  are  cut  at 
each  stage,  we  expect  that  relatively  few  wires  will  have  long  lengths.  This  technique  effectively 
implements  the  same  wirelength  minimization  goals  commonly  used  by  other  floorplanners. 

To  heuristically  apply  our  retiming  constraints  to  floorplanning,  we  adjust  the  weights  on 
each  of  the  edges  in  G  in  order  to  promote  clustering  of  cycle  within  a  single  partition,  rather  than 
spreading  cycles  across  partitions.  Moreover,  intuitively  some  cycles  are  more  “critical”  than  others, 
in  the  sense  that  cutting  such  cycles  during  partitioning  may  lead  to  problems  when  generating  the 
floorplan.  We  thus  wish  to  avoid  cutting  edges  which: 

•  Participate  in  many  cycles 

•  Participate  in  cycles  which  have  few  registers 

•  Participate  in  cycles  which  have  few  modules 

Empirical  observation  shows  that  the  following  formula  gives  good  results  in  practice  when  used 
for  the  weighting  of  edge  e: 

w{e)=  £  M{i)/N{£) 

i€L{e) 

where  L{e)  is  the  set  of  all  cycles  in  which  edge  e  participates,  M{i)  is  the  number  of  modules  in 
cycle  i,  and  N{£)  is  the  number  of  registers  available  in  cycle  £. 

In  our  experiments,  we  generate  two  different  floorplans.  First  we  use  the  wirelength- 
driven  floorplanning  technique  (i.e.  using  unweighted  edges)  to  obtain  a  floorplan.  Then  we  use 
the  edge-weighted  method  to  obtain  a  second  floorplan.  Comparing  the  two  results  thus  give  an 
indication  of  the  improvement  possible  when  using  our  retiming  constraint-driven  technique,  as  a 
measure  of  the  merit  of  our  method. 

3.3.3  Related  Work 

[CLOO,  LimOO]  proposes  a  method  for  simultaneous  partitioning,  floorplanning  and  re¬ 
timing.  Their  algorithm  GEO  effectively  interleaves  timing  analysis  and  retiming  in  a  recursive 
top-down  partitioning  approach.  While  our  technique  shares  a  common  goal,  we  differ  primarily  in 
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that  we  strive  to  fully  deeouple  retiming  from  the  task  of  floorplanning.  In  [LimOO],  it  is  noted  that 
retiming  and  timing  analysis  ean  be  eomputationally  expensive,  and  so  its  applieation  is  limited. 
Our  work  attempts  to  generate  a  viable  floorplan  independent  of  an  explieit  retiming,  henee  doing 
away  with  these  eostly  proeedures  within  floorplanning. 

3.4  Experiment 

Here  we  deseribe  an  experiment  to  validate  our  approaeh  for  floorplanning  under  retiming 
eonstraints. 

3.4.1  Synthetic  Benchmarks 

Unfortunately,  there  are  no  freely  available  benehmark  designs  whieh  refleet  the  large- 
seale  system-on-a-ehip  designs  whieh  we  are  targeting.  To  remedy  this,  we  have  developed  a  syn- 
thetie  benehmark  generation  teehnique  suitable  for  the  designs  whieh  we  would  like  to  see.  There 
have  been  several  attempts  at  generation  of  realistie  synthetie  benehmark  eireuits  [DD96,  HRGC98, 
SDC99],  but  all  of  these  methods  use  partieular  features  whieh  refleet  gate-level  designs,  rather  than 
the  maerobloek  examples  we  would  like. 

We  have  thus  taken  the  method  of  [HRGC98]  and  modified  it  to  refleet  our  goals.  As  in 
the  original  method,  distributions  of  various  quantitative  metries  are  extraeted  from  a  “seed”  design, 
and  these  parameters  are  used  to  randomly  generate  new  designs  whieh  have  similar  eharaeteristies. 
Our  method  differs,  though,  in  that  the  eharaeteristie  distributions  we  eonsidered  to  be  important  are 
the  number  of  output  pins  per  bloek,  degree  of  fanout  per  output,  bloek  size,  and  the  size  of  eycles 
in  the  design.  All  but  the  last  eharaeteristie  is  generated  direetly  from  the  seed  design;  in  order  to 
generate  designs  with  the  eorreet  distribution  of  eyeles,  we  use  a  ripup-and-retry  teehnique  similar 
to  [HRGC98].  Algorithm  3.1  shows  the  algorithm  we  use  for  this  proeess. 

In  assigning  bloek  sizes,  we  first  obtain  the  distribution  of  bloek  sizes  from  the  seed 
design  and  fit  a  binomial  distribution  eurve  to  these  statisties.  Bloek  sizes  for  the  generated  eireuits 
are  generated  randomly  using  this  binomial  distribution.  This  eurve  fitting  is  done  so  that  the  bloek 
sizes  for  our  generated  eireuits  have  a  “smoother”  distribution  over  the  full  range  of  the  possible 
sizes. 

For  eaeh  bloek,  we  assign  it  a  random  fanout  eount  based  on  the  distribution  of  fanouts 
from  the  initial  seed  design.  We  do  not  fit  this  to  a  binomial  distribution  as  the  fanout  eounts  tend 
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Algorithm  3.1  Synthetic  Benchmark  Generation 
1:  Input:  initial  seed  design  and  target  design  size  n 

2:  Obtain  block  size,  fanout  and  cycle  count  statistics  from  seed 

3:  Generate  n  blocks  with  random  sizes  and  fanout  counts 

4:  repeat 

5:  for  /  =  2  to  max-cyclesize  do 

6:  if  too  many  cycles  of  size  i  then 

7:  Choose  an  edge  e  of  some  cycle  of  size  i 

8:  Randomly  change  target  of  e  to  some  other  vertex 

9:  else  if  too  few  cycles  of  size  i  then 

10:  Choose  a  random  vertex  u 

11:  Find  a  path  of  length  i  —  1  from  u  to  some  vertex  v 

12:  Choose  a  fanout  edge  of  v  and  change  its  target  to  u 

13:  until  no  change 


to  be  small,  discrete  values,  unlike  the  area  statistics.  The  target  block  for  the  fanouts  are  initially 
assigned  randomly. 

After  the  initial  block  generation,  the  algorithm  iterates  over  all  sizes  of  cycles  (i.e.  rang¬ 
ing  from  2  to  the  size  of  the  largest  cycle  in  the  design).  For  each  cycle  size,  a  target  range  is 
obtained  by  using  the  number  of  cycles  of  that  size  present  in  the  seed  design  and  allowing  for  up 
to  10%  variation.  If  the  number  of  cycles  of  the  given  size  lies  above  this  range,  edges  in  the  graph 
are  modified  to  remove  a  cycle  of  this  size.  Likewise,  if  the  number  of  cycles  of  the  given  size  lies 
below  this  range,  edges  in  the  graph  are  modified  fo  add  a  cycle  of  Ibis  size.  In  eifher  case,  fhe 
fanoul  characferisfics  of  fhe  generafed  design  are  nol  changed. 

Allhough  Ihere  is  no  guarantee  of  convergence  wilh  Ibis  algorilhm,  since  adding  or  re¬ 
moving  a  cycle  by  modifying  an  edge  can  change  olher  cycles  in  which  lhal  edge  participates,  we 
found  in  praclice  Ibis  approach  worked  well  enough  for  our  purposes;  runs  which  failed  lo  terminate 
in  reasonable  time  could  simply  be  ignored.  The  10%  variation  in  the  target  cycle  size  distribution 
can  be  increased  if  faster  termination  is  desired,  or  decreased  if  closer  fidelity  to  the  original  seed 
design  is  a  goal. 

Note  that  the  process  for  generating  synthetic  designs  does  not  account  for  S{e),  the  ini¬ 
tial  retiming  labeling.  We  generate  S{e)  for  the  synthetic  circuit  by  generating  a  slicing  structure 
floorplan  using  recursive  mincut  partitioning,  then  assigning  registers  to  edges  in  proportion  to  their 
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corresponding  edge  lengths  in  the  resulting  layout.  This  was  done  to  approximate  how  such  a  sys¬ 
tem  might  be  designed  in  real  life;  communication  overhead  between  tightly-coupled  blocks  would 
tend  to  be  minimized  for  performance  reasons. 

3.4.2  Technology  Assumptions 

An  abstract  top-level  description  of  the  Alpha  21264  processor  was  used  as  the  seed 
design  for  our  experiments  in  this  paper.  Although  we  do  not  possess  an  actual  Alpha  design, 
high-level  architectural  descriptions  provided  information  about  the  basic  blocks  (functional  units, 
caches,  etc.)  and  their  interconnectivity,  and  a  chip  micrograph  provided  information  about  the  rel¬ 
ative  areas  of  the  blocks  [Kes98,  AlpOO].  A  description  for  this  design  is  shown  in  Table  3.1.  Note 
the  intent  here  is  not  to  produce  an  exact  replica  of  the  Alpha  processor,  but  rather  to  provide  an 
abstract  but  generally  realistic  model  of  a  modern  IC  design.  We  believe  that  the  interconnectivity 
between  blocks  for  this  design  will  be  similar  to  the  high-level  designs  we  will  face  in  the  next 
decade. 

Eight  benchmarks  were  synthesized  from  the  Alpha  design,  ranging  from  24  to  32  mac¬ 
roblocks  in  the  generated  designs;  the  original  design  had  24  blocks.  Areas  for  the  blocks  were 
scaled  so  that  the  final  designs  would  fit  on  a  square  die  approximately  24mm  per  side.  Note  that  the 
relationship  between  distance  and  delay  is  dependent  on  the  physical  characteristics  of  the  die;  here 
we  need  to  make  some  rough  estimates  for  future  technology.  We  assume  that  the  linear  propaga¬ 
tion  constant  for  long  optimally -buffered  interconnect  is  16qm/ ps;  note  that  this  yields  a  delay  from 
corner-to-corner  on  the  die  of  3000^5  (6  clock  cycles  at  2GHz)-  We  also  assume  t{m,n)  =  lOOp^  as 
the  input-to-register  delay,  uniformly  for  all  macroblock  inputs. 

3.4.3  Results 

Table  3.2  shows  the  results  obtained  from  the  eight  synthetic  benchmark  designs  generated 
as  described  above.  The  columns  labeled  Normal  indicate  the  layout  results  using  an  ordinary 
floorplanning  technique  which  generates  a  slicing  structure  using  mincut  partitioning.  The  mincut 
heuristic  used  here  emulates  the  wirelength  minimization  goal  typical  of  current  state-of-the-art 
floorplanning  tools.  The  (|)  column  indicates  the  minimum  clock  period  obtained  using  the  normal 
technique.  The  wlen  column  indicates  the  total  length  of  the  macroblock  interconnects,  estimated 
using  the  half-perimeter  bounding  box  model  for  the  nets.  The  columns  labeled  Cycle  indicate  the 
same  results  using  the  retiming  constraint  edge-weighting  method  described  in  Section  3.3.2.  The 
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Improvement  column  gives  the  percentage  improvement  (reduction  in  clock  period  or  wirelength) 
from  the  normal  layout  scheme  to  the  cycle-constrained  method. 


Design 

Normal 
(|)  wlen 
(ps)  (mm) 

Cycle 
(|)  wlen 
(ps)  (mm) 

Improvement 
(|)  wlen 
(%)  (%) 

1 

913 

388 

609 

402 

33.3 

-3.6 

2 

589 

544 

530 

582 

10.0 

-7.0 

3 

584 

602 

568 

633 

2.7 

-5.1 

4 

814 

543 

675 

597 

17.1 

-9.9 

5 

611 

398 

562 

432 

8.0 

-8.5 

6 

657 

568 

625 

623 

4.9 

-9.7 

7 

579 

608 

768 

685 

-32.6 

-12.7 

8 

766 

807 

688 

935 

10.2 

-15.9 

Average 

689 

557 

628 

611 

8.9 

-9.7 

Table  3.2:  Results  for  edge-weighted  floorplanning  using  retiming  constraints.  Normal  indicates 
unweighted  floorplanning  results,  Cycle  indicates  edge-weighted  floorplanning  results,  Improve¬ 
ment  indicates  the  improvement  seen.  (|)  indicates  the  cycle  time  of  the  floorplanned  design,  wlen 
indicates  the  wirelength. 

On  average,  our  edge-weighting  technique  decreases  the  clock  cycle  for  the  floorplanned 
designs  by  8.9%,  while  wirelength  increases  by  9.7%.  The  wirelength  increase  is,  of  course,  ex¬ 
pected,  as  our  optimization  goal  differs  from  the  traditional  wirelength  metric.  These  results  show 
that  the  wirelength  penalty  is  roughly  commensurate  with  the  performance  improvement. 

Note  that,  for  all  but  Design  7,  using  the  edge-weighting  heuristics  improves  the  clock 
period  obtained  for  the  design.  For  this  particular  circuit,  we  found  that  the  critical  cycle  (i.e. 
the  cycle  which  constrains  the  clock  period)  is  broken  early  in  the  slicing  tree,  and  hence  overly 
lengthened,  using  our  heuristics.  As  well,  this  critical  cycle  has  no  extra  registers  available  for 
retiming,  which  further  constrains  the  clock  period. 

One  difficulty  with  the  heuristic  edge  weighting  presented  here  is  that  it  depends  on  enu¬ 
merating  all  simple  cycles  of  the  design.  The  number  of  cycles  is  potentially  exponential  in  the 
number  of  edges  in  the  graph.  For  high-level  floorplanning  tasks,  this  may  not  be  a  problem,  as 
there  are  relatively  few  macroblocks  and  hence  relatively  few  edges.  Additionally,  typical  designs 
are  expected  to  have  fewer  cycles  than  the  theoretical  worst  case.  This  is  demonstrated  by  the  syn¬ 
thetic  benchmarks  generated  here;  these  designs  only  have  several  thousand  cycles  in  their  structural 
graphs  apiece.  However,  the  potentially  large  number  of  cycles  may  pose  a  problem  for  a  standard¬ 
cell  placement  algorithm  which  must  be  able  to  deal  with  thousands  or  millions  of  placeable  objects. 
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One  possible  way  to  avoid  enumerating  all  simple  eyeles  is  to  only  eonsider  the  funda¬ 
mental  cycles  of  the  graph,  a  teehnique  suggested  in  [SSBSV92].  Sinee  the  number  of  fundamental 
eyeles  is  generally  mueh  smaller  than  the  number  of  simple  eyeles  in  a  graph,  this  may  be  an  effee- 
tive  teehnique  for  avoiding  this  problem.  However,  we  would  have  to  adjust  our  weighting  heuristie 
somehow,  as  we  eurrently  use  the  number  of  simple  eyeles  in  whieh  an  edge  partieipates  as  an  im- 
plieit  part  of  the  weighting  funetion.  We  address  the  potential  explosion  in  the  number  of  eyeles  in 
Chapter  4.  There,  heuristie  teehniques  for  plaeement  using  physieal  eyele  eonstraints  whieh  do  not 
require  explieit  enumeration  of  all  eyeles  in  the  design  are  given. 

3.5  Summary 

In  this  ehapter,  a  new  proof  for  the  eharaeterization  of  valid  retimings  was  given.  This 
proof  motivates  a  heuristie  edge  weighting  teehnique  for  lioorplanning  whieh  eaptures  the  flexibility 
of  retiming  along  global  intereonneets.  Experimental  results  show  these  heuristies  ean  be  used  to 
yield  a  tradeoff  between  design  performanee  and  total  wirelength. 
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Chapter  4 

Placement  Driven  By  Sequential  Timing 
Analysis 

4.1  Introduction 

Chapter  3  presented  a  heuristie  teehnique  for  ineorporating  retiming  eonstraints  into  a 
floorplanning  tool.  As  previously  noted,  a  potential  drawbaek  with  the  proposed  heuristie  is  its 
relianee  on  enumerating  all  simple  eyeles  in  the  struetural  graph  representing  the  design.  While 
this  may  be  aeeeptable  for  floorplanning  teehniques  whieh  deal  with  relatively  small  numbers  of 
objeets,  for  standard  eell  plaeement  sueh  a  heuristie  beeomes  intraetable.  This  ehapter  presents 
two  approaehes  for  dealing  with  plaeement  in  the  eontext  of  sequential  optimization  whieh  avoids 
the  problem  of  eyele  enumeration.  Experimental  results  on  aeademie  and  industrial  designs  indi- 
eate  these  teehniques  yield  signifieant  performanee  benefits  for  the  final  designs  after  sequential 
optimization. 

4.1.1  Background 

Sequential  optimization  teehniques  have  the  potential  to  signifieantly  improve  the  perfor¬ 
manee,  area,  and  power  eonsumption  of  a  eireuit  implementation  to  a  degree  that  is  not  aehievable 
with  eombinational  synthesis  methods.  The  goal  is  to  balanee  the  path  delays  between  registers  and 
thus  to  maximize  the  eireuit  performanee  without  ehanging  its  input/output  behavior. 

Practieal  sequential  optimization  methods  of  interest  are  retiming  [LS83,  LS91]  and  eloek 
skew  seheduling  [Fis90] .  Retiming  is  a  struetural  transformation  that  moves  the  registers  in  a  eireuit 
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without  changing  the  positions  of  the  combinational  gates.  Retiming,  although  algorithmically 
well  studied,  has  gained  only  limited  use  because  of  its  impact  on  the  verification  flow  and  the 
inability  to  accurately  model  the  load  changes  caused  by  register  moves.  When  applied  before 
placement,  retiming  can  perform  a  coarse  balancing  of  paths  delays  using  wiring  estimates.  In- 
place  retiming  is  applied  after  physical  placement  and  incrementally  repositions  individual  registers 
based  on  a  precise  evaluation  of  the  timing  impact  including  the  change  of  interconnect  delays. 
In-place  retiming  is  limited  to  local  perturbations  of  the  placement  and  cannot  correct  for  global 
problems. 

Clock  skew  scheduling  preserves  the  circuit  structure  by  applying  non-zero  delays  to  the 
register  clocks  —  thus  virtually  moving  them  in  time.  In  recent  years,  clock  skew  scheduling  has 
gained  practical  acceptance  in  multiple  design  flows,  typically  applied  as  a  post-placement  optimiza¬ 
tion  technique.  Implementation  strategies  vary  from  clock  tree  topology  construction  [HKM+03, 
KF99]  to  clock  tree  routing  algorithms  [XD97] .  Recent  work  on  multi-domain  clock  skew  schedul¬ 
ing  [RKS03]  has  demonstrated  that  even  with  a  very  limited  number  of  skew  possibilities,  almost  all 
of  the  benefits  can  be  realized;  implementing  a  non-zero  clock  skew  schedule  in  hardware  may  be 
easier  than  commonly  believed.  For  the  purposes  of  this  work,  however,  we  remain  implementation 
independent. 

The  optimization  potential  of  retiming  and  clock  skew  scheduling  is  bounded  by  the  crit¬ 
ical  cycle,  which  is  among  all  structural  cycles  of  a  circuit  the  one  with  the  maximum  value  for 
^ cycle  =  totaLdelay /number .registers.  If  the  combinational  delays  of  all  paths  along  this  cycle  are 
perfectly  balanced  by  retiming  or  clock  skew  scheduling,  then  the  design  can  be  clocked  with  a 
period  (|)  >  cycle- 

Traditional  timing-driven  placement  (e.g.  [SCK92])  minimizes  the  overall  wire-length 
with  the  additional  constraint  that  no  combinational  path  is  timing  critical,  i.e.  the  sum  of  the  gate 
and  interconnect  delays  on  every  path  between  two  registers  does  not  exceed  the  clock  period  (|). 
This  notion  of  timing  criticality  is  confined  to  the  paths  between  single  sets  of  registers  and  does 
not  adequately  capture  the  timing  picture  of  the  circuit  when  registers  positions  can  be  moved.  A 
cyclic  set  of  paths  may  be  non-critical  with  respect  to  a  flexible  register  clocking  even  if  some  of 
the  individual  paths  significantly  exceeds  the  cycle  period.  Likewise,  combinational  paths  that  have 
significant  combinational  slack  may  be  part  of  the  critical  cycle.  The  combinational  delay  view  of  a 
design  does  not  reveal  much  information  about  its  true  sequential  criticality  and  may  easily  mislead 
the  placer  with  respect  to  the  overall  optimization  problem. 
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4.1.2  Existing  Work 

Previous  work  has  addressed  the  integration  of  seleetive  sequential  optimization  teeh- 
niques  and  plaeement,  but  these  attempts  remain  ineomplete.  [CLOO,  LimOO]  presents  a  partitioning 
and  floorplanning  approaeh  whieh  eonsiders  retiming  moves  for  registers  on  long  intereonneets. 
There,  the  eoneept  of  sequential  slack  is  used  to  express  the  sequential  mobility  of  a  register.  These 
slaeks  are  used  as  weights  in  the  partitioning  algorithm  to  approximate  the  sequential  eritieality  of 
an  edge.  This  work  is  extended  in  [CY03]  for  a  multilevel  plaeement  algorithm  using  simulated 
annealing.  However,  as  diseussed  in  Seetion  4.3,  the  sequential  slaek  may  dramatieally  underes¬ 
timate  the  edge  eritieality  in  the  presenee  of  multiple  eritieal  regions  and  thus  lead  to  suboptimal 
plaeements;  the  version  of  sequential  slaek  eomputation  using  the  referenee  point  of  [CLOO]  entirely 
misses  the  eritieal  eyele  in  several  of  our  examples.  The  first  version  of  our  algorithm  presented  in 
Seetion  4.4.1  builds  on  this  notion  of  sequential  slaek.  Unlike  the  work  in  [CLOO],  we  guarantee  that 
the  most  eritieal  eyele  will  always  be  identified,  sinee  we  ehoose  a  register  whieh  lies  on  this  eyele 
as  the  referenee  point  for  the  sequential  slaek  eomputation.  The  seeond  version  of  our  algorithm 
presented  in  Seetion  4.4.2  eorreets  the  problem  for  all  potentially  eritieal  eyeles  by  introdueing 
explieit  wire-length  eonstraints  for  eireuit  loops  that  are  near  eritieal.  We  use  Lagrangian  relax¬ 
ation  to  handle  these  eonstraints  in  an  analytieal  plaeement  phase  similar  to  the  approaeh  presented 
in  [SCK92].  Repeating  the  timing  analysis  after  eaeh  plaeement  iteration  additionally  improves  the 
odds  that  sueh  eyeles  are  identified  and  algorithmie  eonvergenee  is  aehieved. 

In  [YMS03]  a  budgeting  algorithm  is  presented  that  eomputes  delay  bounds  for  a  tradi¬ 
tional  plaeement  algorithm  under  the  assumption  that  retiming  ean  be  applied.  In  eontrast  to  our 
work,  this  approaeh  separates  the  budgeting  and  plaeement  phases  and  as  a  result  eannot  take  the 
dynamie  interaetion  between  plaeement,  wire  delays,  and  sequential  optimization  into  aeeount.  A 
tight  integration  of  sequential  timing  and  plaeement  is  needed  to  eapture  this  eomplex  interaetion. 

Work  on  integrating  retiming  and  plaeement  for  field  programmable  gate  arrays  is  de- 
seribed  in  [SB02].  There,  the  authors  also  extend  the  sequential  slaek  eomputation  teehnique  found 
in  [CLOO].  However,  they  attempt  to  overeome  the  shorteomings  of  [CLOO]  through  a  proeess  of 
random  sampling  of  referenee  points.  While  this  is  eertainly  better  than  ehoosing  a  single  arbitrary 
referenee,  this  will  still  be  ineffeetive  in  identifying  the  eritieal  eyele  if  the  sampling  proeess  does 
not  happen  to  ehoose  a  register  whieh  lies  on  that  eyele.  [SB02]  also  demonstrates  good  results  using 
a  net  weighting  heuristie  similar  to  our  teehnique.  However,  given  their  framework  of  annealing- 
based  plaeement,  it  is  unelear  how  to  effieiently  inelude  explieit  eyele  eonstraints,  deseribed  here  in 
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Section  4.4.2.  Our  use  of  a  quadratic  programming-based  formulation  allows  easy  addition  of  these 
powerful  constraints  through  Lagrangian  relaxation. 

4.2  Motivating  Example 

Timing-driven  placement  makes  possible  performance  improvements  over  the  pure  min¬ 
imization  of  wire-length.  The  portions  of  interconnect  that  have  been  identified  as  being  timing 
critical  can  be  given  increased  weight  or  attention  to  further  reduce  their  lengths.  This  objective 
may  come  at  the  expense  of  other  wires,  but  the  slack  available  on  these  non-critical  nets  allows  the 
wire  delay  increase  to  not  affect  the  final  clock  period. 

With  the  availability  of  in-place  retiming  or  clock  skew  scheduling,  the  wires  that  limit 
the  achievable  clock  period  after  sequential  optimization  are  not  necessarily  the  ones  that  limit  the 
clock  period  beforehand.  Figure  4. 1  shows  a  simple  sequential  circuit.  It  is  clear  that  without  any 
changes  to  the  clocking,  the  path  from  to  x  is  period-limiting.  However,  if  the  relative  arrival  of 
the  clock  at  registers  a  and  b  can  be  moved  by  1  delay  unit  backward  and  3  delay  units  forward, 
respectively,  the  paths  between  x,  y,  and  z  will  limit  the  period.  The  information  from  static  timing 
analysis  is  essentially  meaningless  when  register  boundaries  can  be  moved. 


Figure  4.1:  An  example  of  two  sequential  cycles.  Combinational  gates  are  labeled  with  their  delays. 
The  lower  set  of  registers  {x,y,z}  forms  the  critical  cycle  and  will  limit  the  clock  period  after 
sequential  optimization.  However,  a  static  timing-driven  tool  will  incorrectly  target  the  path  from  b 
to  X,  likely  at  the  expense  of  other  paths. 


Experimental  observation  suggests  that  the  failure  of  static  timing  analysis  to  give  an 
accurate  picture  of  the  post-sequential  optimization  timing  can  be  significant  in  industrial  designs. 
Figure  4.2  plots  the  distribution  of  combinational  path  slack  above  the  distribution  of  true  sequential 
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path  slack  in  one  typical  industrial  design.  It  is  immediately  apparent  that  they  offer  very  different 
versions  of  the  timing  criticality  of  the  circuit.  True  sequential  slack,  described  in  more  detail  in 
Section  4.3.3,  measures  the  amount  of  delay  that  can  be  added  to  a  path  before  it  becomes  critical 
during  post-placement  retiming  or  clock  skew  scheduling.  There  is  clearly  more  flexibility  when 
sequential  optimization  techniques  are  considered;  because  slack  can  accumulate  across  multiple 
register  boundaries,  most  of  the  paths  in  this  design  see  more  than  one  clock  period  of  slack. 


Figure  4.2:  Combinational  slack  and  true  sequential  slack  for  design  Ind08.  Combinational  slack 
distribution  shown  on  top,  true  sequential  slack  distribution  shown  on  bottom.  Horizontal  axis 
indicates  slack  value  after  optimal  clock  skew  scheduling,  in  multiples  of  clock  cycle  (|).  Vertical 
axis  shows  number  of  combinational  paths  between  registers. 


The  use  of  inappropriate  timing  information  to  guide  the  placement  process  can  reduce 
the  final  design  performance.  If  we  again  consider  the  example  in  Figure  4.1,  a  decision  to  shorten 
the  path  from  b  to  x  at  the  expense  of  the  path  from  y  to  z  appears  to  be  beneficial  from  a  static 
timing  standpoint.  However,  doing  so  would  actually  result  in  an  increase  in  the  clock  period  af¬ 
ter  sequential  optimization.  Experimental  results,  shown  in  Table  4.1,  indicate  that  this  effect  is 
demonstrably  present  during  placement  of  several  industrial  designs;  timing-driven  placement  in- 


48 


deed  yields  a  design  with  a  smaller  clock  period  before  sequential  optimization,  but  such  placements 
have  a  larger  final  clock  period  after  sequential  optimization,  in  some  cases  even  worse  than  what 
could  be  achieved  without  any  timing  optimization  at  all.  This  research  addresses  this  problem. 
Instead  of  simply  trying  to  correct  the  results  of  combinational  timing  analysis  after  placement  is 
finished,  however,  we  affempf  fo  fully  exploif  fhe  pofenfial  of  in-place  reliming  and  clock  skew 
scheduling. 

Figure  4.3(a)  illuslrales  a  placemen!  generaled  using  a  Iradilional  QP  placemenl  fool  based 
on  GORDIAN  [KSJ88,  KSJA91],  wifh  fhe  sel  of  pafhs  fhaf  are  limiling  fhe  posl-sequenfial  opfimiza- 
lion  liming  highlighled.  Figure  4.3(b)  shows  fhe  same  circuil  placed  using  a  liming-driven  placer 
which  uses  combinalional  sialic  liming  analysis.  Figure  4.3(c)  shows  again  Ihe  same  circuil  placed 
using  CAPO,  a  slale-of-lhe-arl  placemenl  tool  developed  al  UCLA  [CKMOO]. 

In  Ihese  Ihree  cases,  Ihe  tools  have  done  a  visibly  suboplimal  job  of  minimizing  Ihe  wire 
segmenls  lhal  are  liming-crilical;  many  of  Ihem  nearly  cross  Ihe  widlh  of  Ihe  die.  Even  wilh  reg¬ 
ister  movemenl  Ihrough  in-place  reliming,  if  is  nol  likely  lhal  Ihe  severe  mislocalion  of  Ihe  crilical 
registers  in  Ibis  parlicular  layoul  could  be  fully  corrected.  In  conlrasl.  Figure  4.3(d)  is  Ihe  resulting 
placemenl  from  our  own  sequential  placemenl  tool.  The  critical  elemenls  are  much  more  localized, 
and  Ihe  conlribulion  of  inlerconnecl  delay  to  Ihe  clock  period  is  dramatically  reduced.  In  Ibis  exam¬ 
ple,  Ihe  integration  of  sequential  timing  analysis  into  Ihe  physical  design  process  has  significanlly 
boosted  Ihe  ulilily  of  clock  skew  scheduling  or  in-place  retiming. 

4.3  Sequential  Timing  Analysis 

Given  a  circuil  wilh  timing  information,  we  are  interested  in  Ihe  minimum  clock  period 
achievable  under  sequential  optimization  to  evaluate  ils  Illness,  and  to  characterize  Ihe  sequential 
crilicalily  of  each  component  There  are  Ihree  phases  of  Ibis  analysis:  conslruclion  of  Ihe  sequen¬ 
tial  timing  graph,  identification  of  Ihe  minimum  feasible  clock  period,  and  Ihe  determination  of 
sequential  slacks. 

4.3.1  Constructing  The  Sequential  Timing  Graph 

The  sequential  timing  graph  G  =  {V,E)  is  exlracled  from  a  circuil  as  follows.  Take  V  to 
be  Ihe  registers  of  Ihe  design,  logelher  wilh  an  additional  vertex  Vext  representing  Ihe  primary  inpuls 
and  oulpuls.  Add  an  edge  (m,v)  to  E  iff  Ihere  exisls  a  timing  palh  from  u  to  v  in  Ihe  original  circuil. 


(a)  GORDIAN  Placement 


(b)  Combinational  Timing-Driven  Placement 


(c)  CAPO  Placement 


(d)  Cycle-Constrained  Placement 


Figure  4.3:  Placements  for  design  Ind08.  Critical  cycles  shown  in  black;  standard  cells  in  gray. 
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Every  edge  e  =  (m,  v)  is  labeled  with  d{e),  the  maximum  delay  between  u  and  v  in  the  circuit.  During 
placement,  the  estimated  wire  delays  are  included  in  d{e). 

4.3.2  Finding  The  Critical  Cycle 

Finding  the  critical  cycle  is  equivalent  to  computing  the  maximum  mean  cycle  (MMC)  for 
G,  which  is  given  by  /l^l?  where  C  is  the  set  of  all  cycles  in  G.  The  MMC  is  equal 

to  the  minimum  clock  period  which  may  be  obtained  using  unconstrained  clock  skew  scheduling.  In 
our  implementation,  we  use  Howard’s  algorithm  [CTCG+98]  to  compute  the  MMC.  This  is  shown 
in  Algorithm  4.1.  [DIG98]  suggests  that  Howard’s  algorithm  is  the  fastest  known  algorithm  for 
computing  MMC.  Our  own  empirical  observations  concur  with  those  of  [DIG98]. 

Algorithm  4.1  Howard’s  Algorithm  For  Computing  MMC 
1:  for  all  M  G  F  do  {initial  guess} 

2:  %{u)  <—  e  for  some  e  =  {u,v)  G  F 

3:  repeat  {main  loop} 

4:  find  all  cycles  Cn  in  Gti  =  (F,  {n{u)  :  «  G  F}  C  F) 

5:  for  all  M  G  F  do  {in  reverse  topological  order} 

6:  if  M  =  FoopHead(F„)  for  some  F„  G  Cn  then 

7:  ri(M)  ^  MeanCycleTime(F„),x(m)  ^  0 

8:  else  {defined  recursively} 

9:  ri(M)  ^  ri(v)  where  7t(M)  =  (m,v) 

10:  x{u)  x{v) +d{Ti{u)) —r\{v) 

11:  for  all  e  =  (m,v)  G  F  do  {modify  7t} 

12:  ifri(M)  <  ri(v)  then 

13:  n{u)  ^  e 

14:  if  n  unchanged  then 

15:  for  all  e  =  (m,v)  G  F  do 

16:  if  Tl(i’)  =  h(n)  and  v(m)  <  x(v)  +d(e)  —  tl(n)  then 

17:  Ti(u)  ^  e 

18:  until  no  change  in  n 
19:  return  max„gyri(M) 


The  general  idea  behind  Howard’s  algorithm  is  to  maintain  a  small  set  of  edges  n  which 
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starts  as  an  initial  guess  of  the  critical  cycle  in  the  graph.  The  set  7t  then  has  edges  added  and 
removed  iteratively  to  monotonically  increase  the  delays  seen  at  the  nodes  in  its  induced  subgraph 
(i.e.  the  subgraph  obtained  by  discarding  all  edges  except  those  edges  in  7t).  When  the  delays  in 
the  induced  subgraph  are  maximized  and  can  no  longer  be  increased  by  changing  n,  then  we  know 
that  n  must  contain  the  critical  cycle,  and  thus  we  obtain  the  MMC.  More  details  and  a  proof  of 
correctness  of  Howard’s  algorithm  can  be  found  in  [CTCG+98]. 

Note  that  not  only  is  the  MMC  a  lower  bound  on  the  clock  period  for  the  design,  but  also 
that  it  is  always  possible  to  find  a  clock  skew  schedule  for  the  registers  which  achieves  the  MMC, 
by  linear  programming  duality.  We  therefore  use  the  MMC  as  a  metric  to  evaluate  the  fitness  of  a 
design,  since  we  are  assured  that,  given  the  freedom  of  clock  skew  scheduling,  the  final  clock  period 
will  be  equal  fo  fhe  MMC. 

4.3.3  Assigning  Sequential  Criticality 

Once  fhe  crifical  cycle  has  been  idenlified,  fhe  relafive  sequenfial  crificalify  of  fhe  ofher 
vertices  can  be  determined.  We  use  a  variant  of  the  sequential  timing  analysis  proposed  in  [CLOO]. 
There,  the  concepts  of  sequential  arrival  and  required  times  and  sequential  slack  are  presented. 
Given  a  target  clock  period  of  (|),  the  sequential  arrival  and  required  times  at  all  vertices  v  G  T 
with  respect  to  a  reference  vertex  v^/  can  be  computed  from  equations  (4.1)  through  (4.3)  using  a 
modified  version  of  the  Bellman-Ford  algorithm. 


AseqiyAref)  —  maX  Aseq{u,Vi-ef)  d(^(^U,V^)  (|) 

(u,v)eE 

(4.1) 

Rseq(v,Vref)  =  min  Rseqiw,Vref)  -  d{{v,w))  + 

{v,w)(iE 

(4.2) 

Aseqiy  ref  Aref)  —  Rseqiy  ref  A  ref)  =0 

(4.3) 

The  sequential  slack  is  Sseq  =  Rseq  —^seq- 

Aseq  and  Rseq  represent  respectively  the  earliest  and  latest  relative  position  in  time  to  which 
a  register  can  be  moved  (by  retiming  or  clock  skewing)  while  still  meeting  timing  with  respect  to  the 
reference  point.  Sseq  measures  the  feasible  range  of  temporal  positions  for  v  relative  to  the  reference 
node.  Intuitively,  sequential  slack  represents  a  metric  for  quantifying  sequential  criticality,  but  we 
note  that  this  criticality  is  only  with  respect  to  the  given  v^/.  A  different  choice  of  v^/  will  impose 
different  constraints.  Figure  4.4  shows  an  example  where  the  given  choice  of  v^/  gives  an  incorrect 
value  for  the  criticality  of  other  vertices  in  the  graph. 
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^seq  —  ^t^seq  — 
^seq 


^seq  —  ^iRseq  — 
Sseq 


^seq  —  ^t^seq  —  5 
Sseq  =  10 


^seq  —  5,Rseq  —  5 
Sseq  =10 


Figure  4.4:  An  example  register  timing  graph.  Cloek  period  (|)  =  10.  Sequential  arrival  times  Aseq, 
required  times  Rseq  and  slaeks  Sseq  are  eomputed  with  respeet  to  the  shaded  vertex  Vref-  The  true 
sequential  slaek  Stme  for  every  vertex  is  zero;  ordinary  sequential  slaek  does  not  offer  an  aeeurate 
pieture  of  sequential  eritieality. 


We  define  the  true  sequential  slack  as 

Strue{v)  =  -  max  Aseq{u,v)  +  v) )  -  (|)  (4.4) 

{u,v)eE 

Strue{v)  gives  the  true  sequential  flexihility  of  v,  in  that  this  value  is  the  maximum  delay  whieh  can 
be  added  to  the  outputs  of  v  without  violating  the  cycle  time  (|).  Although  the  definition  is  straight¬ 
forward,  in  practice  computing  Stme  can  be  prohibitively  expensive;  using  the  above  equations,  this 
is  equivalent  to  the  all-pairs  longest  path  problem  on  the  sequential  timing  graph.  Thus  we  use 
the  ordinary  sequential  slack  Sseq  to  approximate  Stme  as  an  estimate  of  criticality.  Note,  though, 
that  the  potential  difference  between  Stme  and  Sseq  can  be  large;  we  must  choose  v^/  carefully  to 
minimize  the  potential  error.  [CLOO]  takes  v^/  =  Vgxt,  but  this  seems  to  be  a  poor  choice,  as  it  may 
completely  miss  the  actual  critical  cycle.  The  nodes  most  likely  to  impose  tight  constraints  on  other 
nodes  are  those  that  are  themselves  highly  constrained,  i.e.  nodes  on  the  critical  cycle  itself.  We 
thus  use  Sseq{v,Vref)  \  Vref  G  CRITIC alCycle(G)  to  mcasurc  sequential  criticality. 

4.4  Placement  Driven  By  Sequential  Timing 

We  introduce  a  placement  algorithm  that  uses  sequential  timing  information  to  maximize 
the  potential  of  post-placement  retiming  or  clock  skew  scheduling.  The  general  procedure  outlined 
in  Algorithm  4.2  involves  three  phases:  sequential  timing  analysis,  the  assignment  of  weights  based 
on  sequential  criticality,  and  the  introduction  of  explicit  cycle  constraints.  The  algorithm  is  to  some 
measure  independent  of  the  method  used  to  generate  placements;  the  ability  to  weight  nets  and  to 
include  inequality  constraints  are  the  only  specific  requiremenfs. 
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Algorithm  4.2  Sequential  Slaek  Weighting 
1:  sequential  timing  analysis 

2:  assign  net  weights  w{i) 

3:  partition  Pq  <—  allcells 

4:  while  3P(|P|  >  m)  do  {GORDIAN  main  loop} 

5:  solve  global  eonstrained  QP 

6:  bipartition  all  P  where  |P|  >  m 

7:  solve  final  global  eonstrained  QP 

8:  (optional)  do  plaeement  with  eyele  eonstraints  (Algorithm  4.4) 
9:  legalize  plaeement  into  rows 


We  have  implemented  a  modified  version  of  GORDIAN  [KSJ88,  KSJA91].  In  Ibis  pro¬ 
cedure,  phases  of  global  opfimizafion  are  inferleaved  wifh  biparfifioning.  A  quadrafic  programming 
(QP)  problem  is  consfrucfed  fo  minimize  fhe  fofal  weighfed  quadratic  wirelengfh 

[(x,- —  Xy)  +(y;— Jy)  ] 

subjecf  fo  a  sef  of  linear  consfrainfs.  This  problem  is  solved  for  fhe  enfire  chip,  and  fhe  posifions 
of  all  cells  are  updafed.  Based  on  Ibis  informalion,  fhe  cells  in  every  subregion  fhaf  confains  more 
fhan  m  members  are  biparfifioned  fo  minimize  fhe  fofal  number  of  wires  across  fhe  cuf  and  mainfain 
reasonably  balanced  halves. 

We  ufilize  fwo  differenl  parfifioning  fechniques.  Af  fhe  fopmosf  levels,  where  fhe  parfi- 
fioning  is  coarse  and  fhe  informalion  from  fhe  QP  solulion  is  less  useful  fo  guide  fhe  parfifioning, 
we  use  hMelis  [KAKS97,  KK99]  fo  partition  fhe  hypergraph  wilhoul  regard  fo  geomelry.  For  finer 
divisions,  we  choose  a  cul-minimizing  speclral  parlilion  based  on  fhe  QP  solulion,  similar  fo  fhe 
techniques  described  in  [TKH88,  TK91a]. 

The  coordinales  of  fhe  center  of  each  subregion  are  computed  and  a  linear  cenler-of- 
gravily  (COG)  conslrainl  is  imposed  on  ils  members.  The  QP  is  updated  fo  include  Ihese  new 
consfrainfs  and  fhe  global  opfimizafion  is  repealed.  GORDIAN  is  ideally  suiled  fo  fhe  requiremenls 
described  above;  nefs  can  be  easily  weighfed  in  bofh  fhe  global  opfimizafion  and  biparfifioning 
phases,  and  addilional  consfrainfs  can  be  imposed  on  fhe  solulion  of  fhe  QP. 
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4.4.1  Sequential  Slack  Weighting 

Each  net  is  assigned  a  weight  proportional  to  its  relative  sequential  criticality.  This  is  done 
to  give  priority  to  minimizing  the  lengths  of  the  most  critical  wires  as  they  are  the  most  likely  ones 
to  limit  the  achievable  clock  period.  After  the  sequential  timing  analysis  described  in  Section  4.3, 
we  have  a  function  Sseq  that  gives  an  approximation  of  the  sequential  flexibility  at  each  timing  point; 
this  is  the  inverse  of  sequential  criticality.  We  use  the  following  equation  to  compute  the  net  weight 
w(/): 

The  constants  P  and  y  are  chosen  to  tune  the  distribution  of  weights  between  the  most  and  least 
critical  nets.  This  is  then  applied  to  every  connection  a,y,  in  addition  to  scaling  based  on  fanout. 

This  weighting  alone  is  enough  to  produce  layouts  with  improved  sequential  timing  char¬ 
acteristics,  but  its  limitations  should  be  recognized.  Like  their  combinational  counterparts,  sequen¬ 
tial  slacks  are  inherently  incompatible.  Also,  without  computing  the  true  sequential  slacks,  the 
problems  described  in  Section  4.3.3  can  also  arise.  Both  of  these  problems  can  be  solved  with  the 
introduction  of  cycle  constraints.  Our  iterative  algorithm  to  handle  these  constraints  helps  ensure 
that  we  catch  all  critical  cycles. 

4.4.2  Explicit  Cycle  Constraints 

Assuming  complete  flexibility  in  assigning  skew  to  all  registers,  for  a  cycle  i  in  the  circuit 
to  satisfy  a  target  clock  period  (|),  we  must  have 

“I" 

where  tg{£)  is  total  intrinsic  gate  delay  around  i  and  lw{i)  is  the  total  wireload  delay  around  i. 

Suppose  we  have  an  existing  placement  P'  in  which  the  above  constraint  is  violated.  Let 
be  the  wireload  delay  around  i  for  P',  and  let  d{£)  be  the  total  delay  around  i  for  P'.  Then  we 

have 

=  and 

which  defines  the  wireload  delay  reduction  factor  necessary  for  £  to  have  a  valid  clock  skew 
schedule  for  the  target  period. 

Let  (x',  y')  be  the  locations  of  the  cells  for  the  given  placement  P' .  We  wish  to  derive  a  new 
placement  P  =  (x,y)  which  satisfies  the  above  given  delay  constraints.  As  an  approximation,  we 
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take  the  wire  delay  for  a  eyele  as  being  proportional  to  the  sum  of  the  squared  Euelidean  distanees 
between  eells  in  that  eyele.  That  is, 

(-^H ~-^v)  ~\~{yu~yv) 

where  r]  is  a  eonstant.  Thus  the  physieal  plaeement  eonstraints  are 

LMei{^u-x,f  +  {yu-y.f  ^ 

I(«,v)G^  {x'u  -  4)  +  (y'u  -y'v) 

The  denominator  in  inequality  (4.5)  as  well  as  q(£)  are  eompletely  determined  from  the  given  plaee¬ 
ment  and  timing  information.  Thus  inequality  (4.5)  eontains  only  quadratie  terms  in  (x,y).  Also 
note  that  these  eonstraints  are  eonvex. 

We  justify  approximating  total  wire  delay  with  the  sum  of  square  Euelidean  distanees  by 
our  use  of  an  iterative  algorithm  to  solve  the  eonstrained  system.  We  aim  to  make  only  small  ehanges 
to  the  layout  during  eaeh  iteration,  so  that  any  error  in  this  approximation  ean  be  subsequently 
eorreeted.  Details  may  be  found  below. 

Lagrangian  Relaxation 

To  realize  the  plaeement  eonstraints,  we  use  Lagrangian  relaxation,  a  standard  teehnique 
for  eonverting  eonstrained  optimization  problems  into  uneonstrained  problems.  Eor  brevity,  we 
only  present  a  simplified  deseription  of  this  approaeh  here.  More  information  about  Eagrangian 
relaxation  ean  be  found  in  [PR02,  GT96,  SCK92]. 

Eet  /(x,  y)  be  the  sum  of  square  wirelengths  over  all  wires  in  the  design  for  the  plaeement 
(x,y).  Reeall  that  the  elassieal  analytie  plaeement  formulation  is  simply  the  uneonstrained  problem 
minx,y/(x,y).  Our  eonstrained  problem  is  then 

minx,y/(x,y)  sueh  that  g(x,y)<0  (4.6) 

where  the  veetor  g  represents  the  plaeement  eonstraints.  Eor  eaeh  eyele  in  the  design,  there  is  a 
single  element  in  g  whieh  eorresponds  to  the  eonstraint  inequality  (4.5)  for  that  eyele.  We  ereate 
the  Lagrangian  L(x,y,k)  =  /(x,y)  —  k  •  g(x,y),  where  k  is  a  veetor  of  Lagrangian  multipliers',  k 
ean  be  thought  of  as  “penalties”  whieh  serve  to  inerease  the  value  of  the  eost  funetion  whenever  a 
eonstraint  is  violated.  The  Lagrangian  dual  problem  is 


maxk>o  minx,yL(x,y,k) 


(4.7) 
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Our  interest  in  the  dual  problem  lies  in  the  faet  that,  for  eonvex  problems  sueh  as  ours, 
a  solution  for  (4.7)  eorresponds  direetly  to  a  solution  for  the  original  problem  (4.6).  We  use  the 
standard  teehnique  of  subgradient  optimization  to  solve  the  dual;  see  Algorithm  4.3. 

Algorithm  4.3  Subgradient  Optimization  For  Lagrangian  Dual 
1:  k^O 

2:  x,y  ^  argminx,yL(x,y,k) 

3:  while  KKT  eonditions  are  not  satisfied  do 
4:  k^max(0,k  +  Y-g(x,y)) 

5:  x,y  ^  argminx,yL(x,y,k) 


Note  that  for  a  fixed  k,  minx,yL(x,y,k)  ean  be  solved  as  an  ordinary  uneonstrained 
quadratie  program.  Subgradient  optimization  works  by  starting  with  an  initial  arbitrary  k,  solv¬ 
ing  the  resulting  uneonstrained  QP,  then  adjusting  k  based  on  the  violated  eonstraints  whieh  are 
found.  If  a  eonstraint  is  violated,  the  eorresponding  penalty  in  k  is  inereased,  so  that  subsequent 
iterations  will  move  to  reduee  the  violations,  sinee  the  objeetive  funetion,  ineluding  the  penalty 
terms,  is  to  be  minimized.  Heuristies  are  available  to  determine  an  appropriate  step  size  yto  adjust 
k;  e.g.  [SCK92,  PR02].  The  Karush-Kuhn-Tucker  (KKT)  eonditions  for  stopping  the  algorithm  are 
deseribed  fully  in  [PR02].  Roughly  speaking,  the  proeedure  stops  onee  the  penalty  multipliers  grow 
large  enough  to  foree  all  eonstraint  violations  to  zero. 

Of  eourse,  the  design  may  have  many  eyeles,  and  thus  there  may  be  many  eonstraints 
involved.  We  propose  an  iterative  teehnique,  given  in  Algorithm  4.4,  whieh  reduees  the  number  of 
eyeles  under  eonsideration  by  ignoring  non-eritieal  eyeles. 

In  eaeh  iteration,  we  add  the  eritieal  eyeles  found  in  the  eurrent  plaeement  to  the  eon¬ 
straint  set  S.  A  eloek  period  T  is  ehosen  whieh  we  use  as  a  target  period  for  determining  the  eyele 
eonstraints  for  S.  T  is  deereased  slowly  from  Tc,  the  feasible  eloek  period  for  the  eurrent  plaeement, 
down  to  Tf,  the  final  overall  target  eloek  period  for  the  design.  A  slow  adjustment  of  T  helps  ensure 
that  we  do  not  overeonstrain  the  eurrent  eonstraint  set  S  while  ignoring  other  eyeles.  That  is,  we  do 
not  wish  to  “squeeze  too  hard”  on  those  eyeles  whieh  are  eurrently  eritieal,  as  this  may  eause  some 
other  eyele  not  under  eonsideration  to  violate  its  timing  eonstraint.  Also,  as  noted  before,  we  wish 
to  perturb  the  plaeement  only  by  small  amounts,  so  that  any  error  in  our  quadratie  approximation  of 
the  wire  delays  ean  be  eorreeted. 

A  signifieant  benefit  to  using  the  iterative  teehnique  proposed  in  Algorithm  4.4  is  that  we 
are  able  to  eorreet  errors  in  our  estimate  of  the  true  sequential  slaek  so  that  subsequent  iterations 
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Algorithm  4.4  Placement  Using  Cycle  Constraints 
1:  input:  an  initial  placement 

2:  Tc  <—  current  MMC,^  <—  {critical  cycles} 

3:  while  Tc  >  Tf  do 

4:  choose  target  clock  period  T,Tf<T<Tc 

5:  for  all  cycles  £  G  5  do 

6:  add  cycle  constraint  for  £  with  target  T 

7:  remove  all  cells  in  S  from  COG  bins 

8:  solve  QP  with  cycle  constraints  (ALGORITHM  4.3) 

9:  reassign  all  cells  in  S  to  nearest  COG 

10:  solve  QP  with  cycle  constraints  (ALGORITHM  4.3) 

11:  Tc  <—  current  MMC ,  5  <—  5  U  {critical  cycles  } 


may  have  a  better  estimate  of  the  sequential  flexibility  of  each  gate.  Recall  from  Section  4.3.3  that 
we  use  Sseq{v,Vref)  to  approximate  Stmeiv),  the  true  sequential  slack,  by  choosing  v^/  to  be  a  vertex 
on  the  critical  cycle.  As  the  set  of  critical  cycles  tends  to  change  with  each  iteration,  this  helps  to 
ensure  that  we  compute  Sseq  with  respect  to  several  different  choices  of  v^/,  so  that  if  we  mistakenly 
identify  a  critical  vertex  as  non-critical,  we  will  likely  correct  the  mistake  in  subsequent  iterations. 
In  contrast,  [CLOO]  always  takes  v^/  =  Vgxt,  and  so  has  no  opportunity  to  correct  such  errors. 

COG  constraints  are  commonly  used  with  analytic  placement  techniques  to  ensure  that 
the  cells  are  spread  out  relatively  evenly  over  the  entire  die  area.  We  also  wish  our  constrained 
placement  to  be  appropriately  spread  out  over  the  die  area,  but  we  do  not  wish  the  COG  constraints 
to  overconstrain  our  solution.  We  approach  this  problem  using  Steps  7-10  in  Algorithm  4.4,  which 
allows  critical  cells  to  “migrate”  to  appropriate  locations  on  the  die  to  avoid  violation  of  timing 
constraints. 

As  a  practical  point,  we  also  introduce  cycles  which  are  near-critical  during  each  iteration, 
instead  of  only  the  critical  cycles,  to  help  reduce  the  number  of  iterations  performed.  Also,  the 
main  loop  is  terminated  whenever  either  of  the  constrained  QPs  indicate  that  the  problem  may  have 
become  overconstrained,  as  no  further  improvement  becomes  possible  in  such  case. 

Our  approach  shares  some  similarity  with  that  of  [SCK92],  which  also  uses  Lagrangian 
relaxation  in  an  analytic  placement  framework  to  resolve  timing  constraints.  However,  there  are 
several  key  differences  between  our  work  and  that  of  [SCK92].  First,  and  most  important,  is  that  we 
deal  with  the  cyclic  timing  constraints  which  arise  during  clock  skew  scheduling,  rather  than  simply 
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path  constraints.  Second,  we  do  not  use  the  analytic  placement  step  itself  to  perform  timing  anal¬ 
ysis.  The  practical  effect  of  this  is  twofold:  our  approach  allows  us  to  use  general,  nonlinear  (and 
nonconvex)  wire  delay  models,  and  we  also  do  not  encounter  the  degeneracy  problems  inherent  in 
the  constraints  which  come  from  timing  analysis,  as  mentioned  in  [SCK92].  Finally,  we  enjoy  much 
greater  computational  efficiency,  as  our  Lagrangian  function  can  be  seen  as  simply  augmenting  the 
weights  of  edges  between  cells  by  k.  Solving  the  Lagrangian  dual  for  fixed  k  requires  no  more 
computation  than  solving  an  unconstrained  QP  for  our  circuit. 

4.4.3  Row  Legalization 

We  use  a  greedy  approach  to  detailed  placement  and  legalization  of  the  standard  cells  into 
rows.  Such  an  approach  has  the  prime  benefit  of  speed.  However,  instead  of  direct  minimization  of 
wirelength  as  the  search  goal,  as  is  typical  of  most  other  placement  tools,  our  legalization  technique 
instead  seeks  to  minimize  the  total  perturbation  of  the  final  placemenl  wifh  respecf  fo  fhe  solution 
of  fhe  lasf  QP  obfained  during  placemenl  (Slep  7  of  Algorilhm  4.2,  or  Sfep  10  of  Algorifhm  4.4). 
We  do  Ibis  because  we  wish  our  placemen!  lo  be  liming-aware.  Since  fhe  placemenl  obfained  by 
solving  fhe  QP  respecls  fhe  liming  requiremenls  of  fhe  circuil,  we  wish  lo  deviale  from  such  an  ideal 
solufion  as  little  as  possible. 

Algorithm  4.5  Slandard  Cell  Row  Legalizalion 
1:  inpuf:  a  placemenl  solution  from  QP 

2:  Sorl  cells  by  Iheir  y-axis 

3:  Place  cells  into  nearesl  rows  wilh  overflow  into  adjacenl  rows 

4:  for  all  rows  R  do 

5:  Solve  LP  for  R  to  minimize  perlurbalion  from  QP 

6:  while  nol  done  do 

7:  Sorl  cells  in  decreasing  order  of  perlurbalion 

8:  for  all  cells  c  do 

9:  Move  c  to  minimize  perlurbalion 

10:  for  all  rows  R  do 

11:  Solve  LP  for  R  to  minimize  perlurbalion 


Algorilhm  4.5  oullines  our  legalization  technique.  We  firsl  find  an  initial  legal  placemenl 
solution  by  putting  cells  into  Ihe  nearesl  rows.  Cells  are  spilled  into  adjacenl  rows  wherever  row 
capacities  are  exceeded.  Then,  for  each  row,  a  linear  program  is  formulated  and  solved  to  oblain 
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the  placement  of  each  cell  in  the  row.  The  cost  represents  the  sum  of  the  displacements  of  each  cell 
from  its  ideal  location  (as  given  by  the  QP  solution),  while  constraints  are  added  to  forbid  overlap 
of  adjacent  cells  within  the  row. 

Once  the  initial  legalized  solution  is  found,  we  then  proceed  to  make  individual  cell  moves 
to  improve  the  solution  greedily.  The  cells  are  sorted  in  order  of  descending  perturbation  from  the 
QP  solution;  this  helps  ensure  that  cells  which  stand  the  most  to  gain  are  moved  first.  For  each  cell, 
we  determine  a  legal  nonoverlapping  placement  location  which  minimizes  the  perturbation  from 
the  QP  solution,  and  move  the  cell  to  that  location.  After  all  cells  are  moved  in  this  fashion,  we 
compact  all  rows  using  the  same  linear  programming  technique  used  to  obtain  the  initial  solution. 
This  process  is  iterative;  empirically  we  find  only  a  small  fixed  number  of  iferafions  is  required 
before  mosf  cells  find  a  sfable  locafion. 


4.5  Experiments 

We  ran  our  design  flow  on  a  sef  of  fhirfeen  indusfrial  benchmark  circuifs  as  well  as  four- 
feen  synchronous  designs  freely  available  for  academic  use,  including  fhe  largesl  designs  from  fhe 
ISC  AS  89  benchmark  suite.  The  academic  circuifs  were  fechnology  mapped  using  an  indusfrial 
synfhesis  fool  info  a  sfandard  cell  library  chosen  arbifrarily  from  fhe  indusfrial  benchmarks. 

The  indusfrial  libraries  provided  wifh  fhe  designs  used  inferpolafed  lookup-fable  based 
models  fo  characterize  fhe  cells.  Bofh  capacitive  load  and  slew  rate  dependencies  were  incorporafed 
in  our  liming  model.  The  design  fechnology  files  gave  fhe  elecfrical  characferizafion  for  fhe  wires; 
in  all  cases,  we  assumed  fhe  use  of  mefal  layer  3  for  roufing.  We  used  fhe  half -perimeter  bounding 
box  mefric  as  our  esfimafe  of  fhe  wirelengfh,  noting  fhaf  our  algorifhms  are  acfually  independenf  of 
fhe  wireload  esfimafion  fechnique  used,  unlike  ofher  works,  e.g.  [SCK92]. 

Currenfly,  our  placemen!  fool  can  only  handle  single-row  cells,  so  for  fhe  purpose  of  our 
experimenfs,  if  was  necessary  fo  converf  larger  circuif  elemenfs  fo  single-row  insfances.  Double¬ 
row  cells  were  given  a  differenl  aspecf  rafio,  keeping  fhe  same  area.  Large  macros  were  given  an 
arbifrary  size  so  as  fo  fif  in  a  single  row.  I/O  pads  were  assigned  randomly  around  fhe  die  perimefer. 

Limifafions  in  our  liming  analysis  fool  required  some  design  changes  fo  be  made.  Trans- 
parenf  lalches  were  Irealed  as  ordinary  regislers,  and  combinational  cycles  were  broken  arbifrarily. 
Some  hard  macros  did  nol  have  liming  informalion  associaled  wifh  Ihem,  so  for  fhe  purpose  of 
liming  analysis  hard  macros  were  Irealed  as  if  Ihey  were  I/Os  for  fhe  overall  circuif.  Some  designs 
used  mulliple  clock  domains.  As  we  had  no  additional  informalion  regarding  fhe  relative  phases  and 


60 


clock  frequencies  of  such,  we  uniformly  regarded  the  circuits  as  having  only  a  single  clock  domain. 
We  note,  however,  that  the  techniques  described  in  this  paper  can  be  easily  extended  to  multiple 
clock  domains. 

Experimental  results  are  shown  in  Table  4. 1 .  The  Size  column  indicates  the  number  of 
placed  instances  for  the  design.  The  NW  MMC  column  gives  the  MMC  for  the  design  when  no 
wireload  is  taken  into  consideration.  This  is  the  minimum  feasible  clock  period  and  serves  as  a 
lower  bound  on  the  post-placement  timing,  though  any  real  placement  with  non-zero  wirelengths 
will  be  greater.  The  REG  MMC  column  shows  the  MMC  achieved  for  a  completely  placed  design 
using  our  placement  flow  with  equal  weights  attached  to  the  wires;  effectively,  this  is  a  placement 
tool  similar  to  GORDIAN.  The  COM  MMC  column  shows  the  MMC  achieved  after  placement 
using  a  combinational  slack-based  weighting  function  for  the  nets. 

The  SEQ  MMC  column  indicates  the  MMC  achieved  after  placement  using  the  sequential 
slack-based  weighting  for  the  nets  as  described  in  Section  4.4.1.  The  percentage  figure  indicates  the 
reduction  in  wire  delay  for  the  SEQ  MMC  result  compared  with  the  COM  MMC  result.  We  choose 
this  as  a  better  figure  of  merit  than  the  absolute  reduction  in  clock  period,  since  no  placer  can  ever 
hope  to  reduce  the  clock  period  below  the  no  wireload  MMC.  The  Run  column  indicates  the  run 
time  for  this  algorithm,  in  seconds. 

The  CYCEE  MMC  column  indicates  the  MMC  achieved  after  placement  using  the  cycle 
constraint  technique  described  in  Section  4.4.2,  again  with  the  percentage  indicating  reduction  in 
wire  delay  compared  to  the  combinational-weighted  technique,  and  the  Run  column  indicating  the 
run  time  in  seconds. 

We  also  compare  our  tool  against  Capo,  a  leading-edge  placer  which  focuses  on  wire- 
length  minimization  [CKMOO].  As  the  two  placers  have  very  different  objectives,  we  certainly  do 
not  expect  either  one  to  be  competitive  in  the  other’s  problem  domain.  However,  this  comparison 
quantifies  the  benefit  of  using  a  sequential  flexibility-aware  placer,  rather  than  choosing  an  placer 
which  is  best-suited  for  another  task.  The  CAPO  MMC  column  in  Table  4. 1  shows  the  MMC  ob¬ 
tained  after  placement  using  Capo,  the  Run  column  indicates  the  run  time  for  Capo  in  seconds,  and 
the  CYvsCA  column  indicates  the  percentage  improvement  in  wire  delay  of  our  cycle  constraint- 
based  technique  compared  to  the  placement  from  Capo. 

We  show  significant  improvement  in  achievable  clock  period  through  application  of  our 
algorithm.  Eor  the  industrial  benchmarks,  we  achieved  an  overall  improvement  in  wire  delay  of 
23.5%  over  a  combinational  slack-weighted  placement  technique,  and  28.3%  improvement  over  the 
results  of  Capo. 


61 


oo 

(N 

r-; 

00 

OO 

o 

q 

O) 

q 

o 

q 

q 

q 

q 

q 

q 

q 

q 

q 

OO 

q 

q 

q 

q 

> 

u 

< 

u 

in 

T— H 

00 

00 

d 

h-H 

d 

CN 

OS 

os 

d 

00 

so 

SO 

OS 

00 

CN 

00 

so 

00 

00 

00 

00 

CN 

1 

CN 

1 

CN 

1 

00 

CN 

CN 

1 

(N 

so 

00 

CN 

CO 

(N 

C 

so 

oo 

OS 

00 

r- 

r- 

so 

00 

00 

CN 

OS 

CN 

so 

CO 

CN 

O 

00 

CN 

00 

os 

so 

r- 

CO 

00 

CN 

CN 

CN 

CO 

CO 

CO 

CO 

r- 

so 

00 

CO 

CN 

00 

00 

OS 

OS 

O 

so 

OS 

OS 

oo 

CO 

OS 

oi 

,-H 

,-H 

CN 

CO 

OS 

00 

o 

,-H 

so 

o 

Oh 

< 

u 

(j 

CO 

so 

CN 

h-H 

00 

OS 

CO 

00 

o 

os 

OS 

OS 

CN 

00 

00 

00 

00 

OO 

r- 

00 

CO 

OS 

s 

r-; 

CO 

q 

r-; 

CN 

q 

q 

q 

’—1 

q 

q 

’-H 

'—1 

q 

O) 

q 

O) 

oo 

q 

q 

q 

o 

q 

(N 

r4 

,-H 

,-H 

(N 

CN 

CO 

,-H 

so 

oo 

00 

,-H 

os 

00 

T-H 

,-3 

oo 

d 

00 

s 

c 

os 

so 

O 

oo 

OS 

r- 

00 

00 

CO 

CN 

00 

r- 

OS 

00 

CN 

00 

OS 

so 

OS 

o 

r- 

00 

CN 

CO 

CO 

o 

00 

CN 

CN 

00 

oo 

so 

,-H 

00 

CO 

CN 

O 

,-H 

oo 

00 

r- 

r- 

CO 

r- 

r- 

00 

Oi 

1— H 

CN 

CN 

,-H 

,-H 

,-H 

,-H 

CO 

CN 

00 

r- 

o 

so 

so 

so 

CN 

oo 

CN 

T-H 

CO 

,-H 

CN 

OS 

OS 

o 

OS 

o 

,-H 

3 

so 

O) 

q 

(N 

r- 

00 

q 

q 

o 

q 

q 

00 

q 

o 

q 

o 

q 

q 

q 

q 

OS 

q 

q 

q 

u 

CN 

00 

OS 

00 

,-H 

d 

d 

d 

CN 

d 

oo 

d 

CN 

d 

d 

os 

so 

d 

d 

so 

CO 

u 

1 

CN 

CN 

CN 

CN 

CN 

1 

1 

,-H 

CN 

CN 

1 

so 

CN 

T-H 

CN 

T-H 

CN 

00 

CN 

(J 

o 

OS 

r- 

so 

OS 

00 

00 

r- 

OS 

r- 

r- 

CN 

o 

o 

CN 

CO 

CN 

00 

SO 

CN 

OS 

CO 

CO 

00 

s 

CO 

CN 

’—1 

q 

r- 

r-; 

CN 

’—1 

q 

oo 

OO 

q 

q 

00 

oo 

r- 

q 

00 

q 

q 

00 

00 

q 

q 

00 

’-H 

1-H 

H 

(N 

t-H 

t-H 

,-H 

(N 

,— H 

,-H 

CO 

CO 

,-H 

so 

CO 

00 

T-H 

d 

00 

,-H 

d 

d 

oo 

OS 

00 

s 

c 

CO 

r- 

oo 

OS 

00 

00 

CO 

r- 

CO 

so 

00 

,-H 

OS 

os 

OS 

OS 

CN 

CO 

oo 

CN 

os 

OS 

so 

CN 

^H 

T-H 

CN 

CO 

r- 

so 

00 

CO 

,-H 

CN 

so 

CO 

so 

CN 

CO 

CO 

00 

,-H 

CN 

CN 

Qi 

CN 

,-H 

,-H 

CO 

OS 

00 

o 

,-H 

00 

T— H 

r- 

00 

oo 

oo 

oo 

SEQ 

o 

q 

q 

O 

SO 

q 

oo 

q 

q 

q 

00 

q 

O 

o 

q 

O 

q 

q 

q 

q 

OS 

q 

q 

oo 

CN 

d 

00 

OS 

00 

d 

d 

d 

d 

,-H 

t-H 

os 

d 

oo 

d 

d 

CN 

d 

00 

00 

so 

d 

d 

so 

OS 

1 

CN 

CN 

CN 

CO 

1 

1 

CN 

,-H 

1 

so 

(N 

,-H 

CN 

1 

CN 

CN 

00 

u 

o 

OS 

00 

so 

OS 

00 

os 

oo 

CO 

r- 

O 

00 

o 

o 

CN 

00 

CN 

OS 

00 

o 

r- 

CN 

OS 

CO 

CO 

oo 

S 

CO 

CN 

CN 

q 

r-; 

CN 

T-H 

q 

oo 

oo 

q 

q 

q 

oo 

q 

q 

q 

oo 

q 

00 

q 

q 

q 

q 

r4 

,-H 

,-H 

(N 

,-H 

,-H 

CO 

CO 

d 

so 

CO 

00 

,-H 

d 

00 

,-H 

d 

oo 

OS 

00 

S 

,-H 

(J 

1-H 

OO 

CN 

OS 

CN 

OS 

00 

CO 

SO 

so 

os 

oo 

fZl 

r- 

00 

r-H 

CO 

00 

OS 

00 

00 

r- 

so 

COM 

s 

s 

CO 

CN 

CO 

(N 

- 

oo 

oo 

CN 

q 

(N 

q 

oo 

O) 

CO 

q 

CO 

q 

oo 

so 

Sh 

izi 

D 

O) 

CO 

q 

q 

00 

q 

q 

q 

q 

00 

q 

q 

00 

q 

os 

q 

00 

c 

IZI 

1) 

Q 

o 

Q 

REG 

(J 

CO 

CO 

CN 

os 

T-H 

00 

oo 

so 

00 

so 

O 

OS 

'b 

o 

os 

OS 

so 

r- 

os 

00 

os 

CO 

CO 

C3 

s 

so 

00 

q 

q 

oo 

CN 

q 

q 

q 

q 

q 

q 

CD 

q 

q 

q 

q 

q 

q 

q 

q 

q 

q 

q 

00 

’?H 

1-H 

CN 

,-H 

r4 

t-H 

,-H 

(N 

,-H 

CN 

CO 

d 

so 

03 

t-H 

OS 

,-H 

,-H 

00 

so 

,-H 

CN 

,-H 

d 

00 

IZI 

s 

C3 

o 

■o 

c 

u 

r- 

r- 

CN 

00 

00 

r- 

CN 

os 

,-H 

so 

00 

SO 

00 

SO 

CN 

00 

r- 

r- 

00 

o 

r- 

00 

oo 

hJ" 

s 

s 

l-H 

oo 

SO 

r- 

00 

00 

so 

r- 

r- 

SO 

oo 

q 

q 

’-H 

C 

<D 

a 

<D 

q 

q 

q 

q 

q 

q 

q 

q 

so 

q 

q 

00 

q 

c 

o 

a 

<D 

o 

d 

d 

d 

d 

d 

(N 

CN 

CO 

CO 

CO 

00 

CO 

00 

00 

> 

> 

O 

O 

•v 

CN 

OS 

so 

CO 

CO 

OS 

r- 

CO 

CN 

,-H 

r- 

Sh 

o 

oo 

O 

CO 

CO 

oo 

r- 

,-H 

,-H 

,-H 

so 

os 

so 

Sh 

o 

N 

1-H 

r- 

OS 

00 

so 

r- 

,-H 

o 

r- 

,-H 

OS 

00 

& 

CO 

o 

CO 

o 

,-H 

r- 

00 

SO 

oo 

CN 

CO 

CN 

so 

& 

H 

so 

00 

,-H 

os 

00 

00 

CN 

00 

,-H 

B 

o 

oo 

so 

00 

00 

o 

r- 

oo 

00 

OS 

00 

r- 

o 

B 

CN 

CN 

CO 

SO 

so 

r- 

oo 

OS 

O 

o 

00 

so 

CN 

CO 

so 

r- 

,-H 

CN 

so 

,-H 

so 

r- 

o 

CN 

r-H 

CN 

S 

,-H 

h-H 

CN 

CN 

00 

00 

o 

T-H 

CO 

00 

S 

T-H 

T-H 

,-H 

,-H 

T-H 

<v 

<v 

OX) 

OX) 

C 

r- 

O 

CN 

(N 

r- 

CN 

so 

CO 

o 

C3 

u 

(D 

> 

C3 

u 

<D 

> 

W) 

c 

o 

00 

cr 

W) 

r" 

CD 

oo 

H 

CO 

C3 

,-H 

CN 

CO 

00 

SO 

r- 

00 

OS 

o 

t-H 

CN 

CO 

o 

CN 

00 

Oh 

00 

OS 

oo 

s 

& 

o 

O 

o 

O 

o 

O 

o 

o 

o 

,-H 

,-H 

,-H 

,-H 

1) 

S 

CO 

00 

d 

HH 

00 

OO 

00 

CO 

Oh 

s 

03 

03 

03 

03 

03 

03 

03 

03 

03 

03 

■o 

■o 

■o 

Q 

y—i 

izi 

CO 

CO 

CO 

o 

•  TH 

3h 

c 

C 

c 

C 

c 

C 

C 

c 

C 

C 

c 

c 

c 

IZI 

izi 

v: 

03 

d 

zn 

zn 

izi 

00 

> 

CD 

Ij 

Table  4.1:  Results  for  sequentially-aware  plaeement.  Size  indieates  number  of  plaeed  instanees.  NW  MMC  indieates  MMC  when  no 
wireloads  are  taken  into  eonsideration.  REG  indieates  ordinary  GORDlAN-style  plaeement,  COM  indieates  eombinational  slaek-weighted 
plaeement.  SEQ  indieates  sequential  slaek-weighted  plaeement,  CYCEE  indieates  eyele-eonstrained  plaeement,  CAPO  indieates  CAPO 
plaeement.  %  indieates  wire  delay  improvement  eompared  to  COM,  Run  indieates  run  times  in  seeonds.  CYvsCA%  indieates  wire  delay 
improvement  of  CYCEE  eompared  to  CAPO. 
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Note  that  overall  the  benefits  of  using  the  eyele  eonstraints,  eompared  to  simply  using  the 
sequential  slaek-weighting  heuristie,  was  lower  for  the  aeademie  benehmarks  than  for  the  industrial 
eireuits.  Our  observation  is  that,  for  smaller  eireuits,  adding  eyele  eonstraints  tends  to  draw  the 
eells  elose  together,  eausing  signifieant  eell  overlap  in  the  QP  solution.  Sueh  heavy  overlap  makes 
legalization  more  diffieult,  and  all  of  the  performanee  gains  made  from  drawing  the  eritieal  eyeles 
elose  together  are  lost  when  legalization  spreads  the  overlapping  eritieal  eells  apart.  We  have  added 
some  simple  heuristies  to  halt  the  addition  of  eyele  eonstraints  whenever  suffieient  overlap  is  seen. 
However,  more  work  needs  to  be  done  in  this  area. 

We  note  that  sueh  exeessive  eell  overlap  tends  to  happen  more  often  with  small  designs 
rather  than  large  ones.  We  eonjeeture  that  large  designs  tend  to  have  more  I/O  pins  on  their  pe¬ 
riphery,  whieh  gives  rise  to  more  spreading  of  eells  in  the  QP  solution.  One  may  therefore  wish  to 
avoid  the  use  of  eyele  eonstraints,  and  only  utilize  the  sequential  slaek  weighting  heuristie  for  small 
designs. 

The  wirelength-optimizing  version  of  our  tool  was  10.7%  worse  in  total  wirelength  eom¬ 
pared  to  Capo.  With  the  addition  of  eyele  eonstraints,  an  18.8%  inerease  was  measured  over  the 
wirelength-optimizing  implementation.  One  eertainly  expeets  that  a  final  layouf  whieh  meefs  fhe 
liming  eonslrainls  of  fhe  design  will  have  longer  wirelenglh  fhan  a  layouf  whieh  is  done  purely  wifh 
minimizafion  of  wirelenglh  as  a  goal.  The  key  fealure  is  lhaf  even  wifh  Ibis  wirelenglh  penally,  Ihe 
liming  slill  improves.  We  also  note  lhal  our  tool  eurrenlly  does  very  lillle  to  eonlrol  Ihe  lolal  wire¬ 
lenglh  of  Ihe  final  plaeed  design.  As  Ihere  are  many  nels  in  Ihe  design  whieh  are  nol  erilieal,  Ihere 
is  mueh  opporlunily  for  us  to  furlher  reduee  wirelenglh,  espeeially  during  row  legalizalion.  Reeall 
our  legalizalion  proeedure  seeks  only  to  minimize  Ihe  perlurbalion  of  Ihe  QP  solulion.  For  non- 
erilieal  seelions  of  Ihe  design,  if  makes  sense  to  ignore  Ihe  QP  solulion  and  foeus  on  Ihe  Iradilional 
approaeh  of  wirelenglh  minimizafion  instead.  Additionally,  more  modern  partitioning  leehniques, 
sueh  as  Ihose  found  in  Capo,  ean  replaee  our  existing  partitioning  algorilhm  we  use,  espeeially  al 
Ihe  eoarsesl  levels  of  partitioning  where  Ihe  information  provided  by  Ihe  QP  solution  is  limited. 

Run  times  are  measured  in  CPU  seeonds  on  a  2GHz  Pentium  4  proeessor.  Allhough  il 
may  seem  lhal  our  run  times  are  signilieanlly  worse  lhan  Ihe  exeellenl  Capo  tool,  one  musl  note 
lhal  our  design  flow  ineludes  timing  analysis  and  additional  physieal  eonslrainls  for  performanee 
optimization,  whieh  are  eomplelely  laeking  in  Capo. 
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4.6  Discussion  And  Future  Work 


Retiming  and  clock  skew  scheduling  offer  significant  opportunities  for  improving  design 
performance,  but  current  physical  design  tools  do  not  yet  exploit  these  techniques  to  their  fullest  po¬ 
tential.  By  optimizing  the  placement  of  the  most  sequentially  critical  components,  we  have  demon¬ 
strated  that  it  is  possible  to  significantly  improve  the  post-sequential  optimization  timing. 

Another  benefit  from  sequential-driven  placement  is  that  the  aggressive  reduction  of  in¬ 
terconnect  length  in  the  most  sequentially  critical  cycles  will  result  in  their  spatial  localization.  This 
greatly  simplifies  fhe  complexify  of  fhe  disfribufion  problem  for  clocks  wifh  mulfiple  skews.  Ta¬ 
ble  4.2  suggesfs  fhaf  mosf  regisfers  will  nol  even  require  any  skewing.  In  a  fradifional  placemenl, 
fhe  relafive  locafions  of  fhe  mosf  critical  combinafional  pafhs  are  unconsfrained  and  can  be  uni¬ 
formly  disfribufed  across  fhe  chip,  fhereby  necessifafing  fhe  need  for  a  clock  nefwork  wifh  fighfly 
confrolled  skew  across  fhe  die.  Since  fhe  registers  wifh  zero  or  near-zero  slack  will  be  grouped,  fhe 
efforl  fo  accurately  disfribufe  multiple  clock  domains  can  be  concenfraled  on  fhis  region. 


Clock  Offsel 

Fracfion  of  Regisfers 

20%  or  more 

0.04 

10%  fo  20% 

0.01 

5%  fo  10% 

0.01 

2%  fo  5% 

0.01 

-2%  fo  2% 

0.90 

-2%  fo  -5% 

0.00 

-5%  fo  -10% 

0.00 

-10%  fo  -20% 

0.01 

-20%  or  less 

0.03 

Table  4.2:  Clock  skews  necessary  fo  implemenf  MMC  for  cycle-consfrained  placemenl  across  aca¬ 
demic  designs.  Clock  offsel  is  as  a  percenfage  of  fhe  final  clock  period. 

Much  can  be  done  fo  improve  our  currenf  fool.  As  noted  before,  addressing  lolal  wire- 
lenglh  is  an  imporfanl  consideration  and  we  have  already  formulaled  several  ways  fo  approach  fhis 
challenge.  The  weighfing  of  nefs  by  fheir  crificalify  is  admiffedly  somewhaf  ad  hoc,  so  we  may 
use  fhe  ideas  of  [TK91b]  fo  provide  a  sfronger  malhemafical  basis  for  our  weighfing  funclion.  Im- 
plemenfalion  improvemenls  will  allow  us  fo  run  on  larger  designs  and  fo  presenl  more  dafapoinfs 
by  which  fo  judge  our  work;  runlime  can  also  be  improved.  Fufure  work  includes  addressing  fhe 
acfual  implemenlafion  defails  of  reliming  and  clock  skew  scheduling,  adding  in-place  resynlhesis 
oplimizalions  during  fhe  placemenl  procedure,  and  exfending  fhe  sequenlially-aware  timing  model 
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to  other  aspects  of  the  synthesis  flow. 


4.7  Summary 


In  this  chapter,  two  techniques  for  incorporating  sequential  timing  analysis  in  a  placement 
framework  were  shown.  One  uses  a  heuristic  net  weighting  technique,  while  the  other  uses  explicit 
cycle  constraints  in  a  Lagrangian  relaxation  framework.  These  techniques  yield  considerable  im¬ 
provement  in  final  design  performance  compared  to  conventional  placement  techniques. 
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Chapter  5 

Floorplanning  Of 

Asynchronously-Communicating 

Systems 

5.1  Motivation 

The  problem  of  timing  elosure  in  deep  submieron  design  has  motivated  reeent  researeh 
in  the  area  of  what  we  eall  asynchronously-communicating  systems.  Purely  synehronous  designs 
generally  eannot  tolerate  arbitrary  insertion  of  registers,  as  the  delaying  of  signals  for  even  a  single 
eloek  eyele  may  ehange  the  funetionality  of  the  eireuit.  However,  it  is  possible  to  design  systems 
where  the  eorreet  behavior  of  the  eireuit  is  independent  of  the  delays  for  at  least  some  of  the  eom- 
munieation  between  eireuit  elements.  This  has  an  advantage  in  making  the  design  proeess  simpler, 
as  the  designer  ean  then  ignore  the  delay  for  sueh  intereonneets  and  still  maintain  eorreet  behavior 
of  the  final  eireuit.  This  is  espeeially  true  for  today’s  design  flows,  where  the  logieal  funetionality 
is  often  designed  far  in  advanee  of  the  physieal  implementation.  The  early-stage  logie  designer’s 
task  is  greatly  simplified  by  being  able  to  ignore  the  additional  delays  whieh  will  be  ineurred  in  the 
physieal  implementation. 

In  the  literature,  asynehronously-eommunieating  systems  are  known  by  various  names. 
For  instanee,  they  are  ealled  latency-insensitive  systems  in  [C+99,  CSVOO]  and  wire  pipelined 
systems  in  [CM05].  To  avoid  varying  terminology,  we  will  use  the  moniker  asynehronously- 
eommunieating  uniformly  throughout  this  work. 
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We  note  that  asynehronously-eommunieating  systems  need  not  be  implemented  in  a  striet- 
ly  asynehronous  design  methodology.  The  eommunieation  between  eireuit  elements  ean  be  imple¬ 
mented  with  synehronous  logie  using  handshaking  teehniques  [C+99].  However,  the  fundamental 
nature  of  the  communication  remains  asynchronous  even  with  such  a  synchronous  implementation; 
correct  operation  of  the  circuit  is  guaranteed  regardless  of  the  actual  delay  of  the  communication 
channel. 

The  use  of  asynchronously-communicating  design  techniques  poses  a  different  challenge 
for  retiming  and  physical  design.  In  contrast  to  the  purely  synchronous  case,  we  have  extra  de¬ 
sign  flexibility,  since  the  length  of  the  wires  for  asynchronously-communicating  interconnects  is  no 
longer  bounded  by  the  clock  cycle  time  of  the  design.  Such  wires  can  generally  be  made  longer 
than  they  could  be  made  in  a  purely  synchronous  system.  However,  the  extra  delay  incurred  by  ex¬ 
tending  these  wires  does  impact  the  performance  of  the  design.  Essentially,  use  of  asynchronously- 
communicating  design  techniques  relax  the  timing  constraints  into  a  timing  cost  instead. 

5.2  System  Performance  Estimation 

Because  of  the  effects  of  wire  delay  on  performance,  we  must  utilize  techniques  to  es¬ 
timate  the  performance  of  our  system  in  order  to  predict  the  impact  of  our  physical  design.  Most 
straightforward  is  cycle-true  discrete  event  simulation,  which  simply  takes  the  system  and  a  set  of 
typical  inputs  and  simulates  the  system  running  on  the  given  input.  The  drawback  of  this  technique 
is  that  it  is  relatively  slow. 

Queuing  theory  has  been  used  to  a  great  extent  in  performance  prediction  of  computer 
systems  [A1180,  BCMP75].  Computation  elements  are  modeled  as  service  centers  with  queues;  data 
flow  in  the  system  is  represented  as  “customers”  moving  between  the  queues.  The  appeal  of  using 
queuing  networks  for  performance  estimation  is  that  analytic  solutions  for  system  performance  can 
be  obtained  under  certain  assumptions.  However,  one  primary  assumption  is  that  queue  lengths 
are  allowed  to  grow  unbounded.  This  may  be  a  reasonable  approximation  for  simulation  of  larger 
computer  systems,  where  the  service  centers  represent  objects  such  as  disk  caches  and  computer 
systems  with  large  amounts  of  memory,  but  for  low-level  integrated  circuit  design  where  buffers 
tend  to  be  small  and  of  fixed  size,  the  assumption  of  unbounded  queue  length  is  not  justified. 

Petri  nets  have  often  been  used  in  the  asynchronous  design  community  for  modeling  and 
performance  estimation.  An  excellent  overview  of  this  area  can  be  found  in  [Mur89].  For  simple 
Petri  nets  analytic  solutions  for  performance  estimation  can  be  obtained,  however  the  systems  found 
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in  practice  give  rise  to  structure  not  amenable  to  a  closed- form  solution. 

5.3  Existing  Work 

Recently  a  number  of  approaches  to  integrating  floorplanning  with  performance  esti¬ 
mation  for  asynchronously-communicating  systems  have  been  proposed.  Both  [EMW+04]  and 
[CMOS]  propose  using  a  cycle-true  discrete  event  simulation  to  profile  the  relative  utilizations  of 
each  interconnect,  then  use  net  weighting  in  floorplanning  to  keep  those  nets  which  are  used  most 
often  shorter.  This  approach  does  not  account  for  the  relative  criticality  of  the  nets,  however.  A 
highly- utilized  net  may  not  have  any  impact  on  the  overall  performance  of  the  system,  if  other  nets 
are  the  real  limiters  of  performance.  [CMOS]  mentions  this  shortcoming  in  their  assumption  that  the 
net  utilizations  are  independent. 

[LSLH04]  proposes  a  trajectory-based  approach,  which  uses  a  single  cycle-true  discrete 
event  simulation  to  obtain  a  piecewise-linear  model  of  the  system  performance  as  a  function  of  the 
interconnect  delays.  This  model  is  used  to  guide  the  floorplanning  process.  While  [LSLH04]  cites 
reasonably  good  fidelity  of  the  model  in  their  experiments,  it  is  not  clear  how  well  this  approach  will 
scale.  With  the  increasing  impact  of  interconnect  delay,  floorplanning  will  have  a  greater  effect  on 
the  delays,  and  thus  there  is  an  increasing  need  to  evaluate  the  system  performance  model  at  points 
further  away  from  the  initial  characterization. 

5.4  Our  Approach 

There  are  two  key  ingredients  to  our  approach  to  floorplanning  of  asynchronously-com¬ 
municating  systems.  The  first  ingredient  is  the  use  of  a  simplified  simulation  model  for  performance 
estimation.  Use  of  a  simplified  model  avoids  the  costly  computation  incurred  with  cycle-true  simula¬ 
tion.  For  this  purpose,  we  choose  to  use  a  timed  Petri  net  to  model  the  operation  of  the  macroblocks 
to  be  floorplanned.  We  use  simulation  on  the  Petri  net  to  obtain  the  performance  estimation  rather 
than  seek  a  likely  non-existent  closed-form  solution.  Non-determinism  is  introduced  using  weights 
on  the  Petri  net  transitions  which  give  the  relative  probabilities  of  the  transitions  firing. 

To  model  the  flow  of  data  through  the  system,  a  token  source  (i.e.  a  transition  with  no 
incoming  edges)  is  used  to  simulate  an  input  to  the  system,  and  a  token  sink  (i.e.  a  transition  with 
no  outgoing  edges)  is  used  to  simulate  an  output.  The  rate  of  arrival  of  tokens  at  the  token  sink 
becomes  the  throughput  of  the  system. 
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The  second  ingredient  in  our  approach  is  to  use  net  weighting  based  on  the  sensitivity 
of  the  system  performance  on  the  interconnect  delays.  That  is,  we  consider  a  linearization  of  the 
performance  as  a  function  of  the  interconnect  lengths  about  some  given  intermediate  floorplanning 
solution. 

Using  the  Petri  net  model  of  our  system,  we  can  obtain  the  sensitivities  by  labeling  the 
tokens  of  the  Petri  net  with  two  pieces  of  information,  the  time  of  arrival  at  the  place  where  the 
token  currently  resides,  and  a  list  of  positive  integers,  one  element  per  transition,  representing  the 
number  of  times  the  token  was  involved  in  the  firing  of  each  transition.  During  the  simulation  of 
the  Petri  net,  when  a  transition  fires  the  latest  arriving  token  among  those  removed  by  the  firing  is 
used  to  provide  the  labeling  of  the  tokens  added  to  the  system  by  the  firing.  This  reflects  the  fact 
that  it  is  the  latest  arriving  token  which  lies  on  the  critical  path  and  is  the  bottleneck  for  the  system; 
all  other  tokens  have  freedom  to  be  delayed  without  affecting  system  performance.  The  new  tokens 
are  labeled  with  the  list  of  firing  counts  taken  from  the  critical  token  with  the  count  for  the  just-fired 
transition  incremented. 

Tokens  arriving  at  the  token  sink  are  thus  labeled  with  the  number  of  times  the  token 
passes  through  the  transitions  which  restrict  the  throughput.  If  we  sum  all  these  figures  for  the 
tokens  arriving  at  the  token  sink,  we  then  obtain  the  relative  sensitivity  of  the  system  to  the  delays 
for  each  transition. 

To  see  this,  consider  the  following  formalization:  Let  S  be  the  set  of  tokens  in  the  system 
and  T  be  the  set  of  the  transitions.  Let  g  :  5  ^  M  be  the  labeling  of  arrival  times  on  the  tokens,  and 
let/:  5  xT  ^  be  the  function  which  defines  the  labeling  of  the  firing  counts  on  the  tokens;  that 
is,  f{s,t)  is  the  the  number  of  times  token  s  was  associated  with  transition  t.  Suppose  the  delays  of 
the  transitions  are  given  as  h:T  ^  M.  Then  we  have  the  relationship 


teT 

since  the  arrival  time  of  a  token  is  simply  the  sum  of  all  the  transition  times  associated  with  that 
token.  The  sensitivity  of  the  arrival  time  with  respect  to  the  transition  delay  of  t  is  then 


dh{t) 


=  f{s,t) 


While  the  set  of  labelings  on  the  tokens  may  seem  like  a  lot  of  data  to  manipulate,  consider 
that  the  number  of  tokens  is  relatively  small.  For  the  Petri  nets  representing  a  macroblock,  usually 
there  will  only  be  a  single  token  representing  the  internal  state  of  the  macroblock  in  a  one-hot  en¬ 
coding  scheme.  For  interconnects,  there  will  be  one  token  representing  the  presence  of  data  on  that 


69 


interconnect.  Additionally,  we  only  need  to  label  the  tokens  with  only  the  subset  of  transitions  asso¬ 
ciated  with  the  interconnects;  transitions  internal  to  the  macroblocks  can  be  ignored  for  the  purpose 
of  sensitivity.  Under  these  assumptions,  for  a  system  containing  m  macroblocks  and  n  interconnects 
between  macroblocks,  we  expect  there  to  be  at  most  m  +  n  tokens,  each  labeled  with  n  +  \  inte¬ 
gers.  As  typical  high-level  lloorplanning  problems  operate  on  perhaps  dozens  of  macroblocks  and 
hundreds  of  interconnects,  the  size  of  this  data  is  relatively  small. 

Algorithm  5. 1  shows  our  proposed  floorplanning  flow  for  asynchronously-communicating 
systems.  We  use  the  same  slicing  tree -based  floorplanning  framework  described  in  Chapter  3.  The 
key  difference  is  that  instead  of  weighting  nets  based  on  the  criteria  described  in  Chapter  3,  we 
instead  use  the  sensitivities  derived  from  the  simulation  step. 

Algorithm  5.1  Asynchronously-Communicating  Floorplanning  Flow 

1:  Simulate  unfloorplanned  system  and  obtain  net  sensitivities 

2:  Use  recursive  bipartitioning  to  obtain  slicing  floorplan 

3:  Low-temperature  annealing  on  slicing  tree 

4:  Re-simulate  floorplanned  system  to  obtain  new  performance  and  net  sensitivities 

5:  repeat 

6:  Low-temperature  annealing  on  slicing  tree 

7:  Re-simulate  floorplanned  system  to  obtain  new  performance  and  net  sensitivities 

8:  until  no  improvement 


We  use  an  iterative  solution  so  as  to  incorporate  the  changes  in  the  floorplan  solution  to 
the  performance  estimation.  In  other  words,  we  re-linearize  our  performance  estimate  about  the  new 
operating  point.  Termination  of  the  iterations  occurs  when  there  is  no  improvement  in  the  solution. 

5.5  Experiment 

Due  to  the  proprietary  nature  of  such  information,  we  encountered  difficulty  in  obtaining 
real  floorplanning  design  examples  with  detailed  architectural  descriptions.  To  work  around  this,  we 
constructed  a  system  based  on  the  description  of  the  Intel  Pentium  4  processor  with  hyperthreading 
technology  as  described  in  [MBH+02,  MPS02]. 
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5.5.1  Model  Description 

While  not  complete,  the  description  from  [MBH+02]  was  sufficient  to  determine  a  set  of 
macroblocks,  their  rough  functionality  as  a  Petri  net  implementation,  and  the  interconnects.  The 
resulting  design  had  32  macroblocks  and  46  interconnects.  An  illustration  of  the  processor  pipeline 
is  shown  in  Figure  5.1. 

We  chose  the  relative  probabilities  of  the  transitions  in  the  instruction  dispatch  unit  to 
reflect  a  workload  of  10%  floating  point  operations,  80%  integer  operations  and  10%  load/store 
operations.  The  figures  which  should  be  used  here  are  actually  dependent  on  the  workload  to  be 
simulated,  but  we  believe  this  distribution  gives  a  reasonable  workload.  The  token  source  was  set  to 
generate  one  token  (i.e.  one  instruction)  per  clock  cycle,  so  that  the  throughput  of  the  system  would 
be  one  token  per  cycle  if  there  were  no  pipeline  stalls. 

Figure  5.2  illustrates  some  examples  of  the  Petri  net  modules  composing  the  processor 
design.  Figure  5.2(a)  shows  the  Instruction  Fetch  module.  The  place  marked  with  a  token  acts  as  the 
token  source  for  the  overall  design.  The  place  labeled  OUT  TC  represents  the  output  interconnect 
going  to  the  Trace  Cache  (the  level  1  instruction  cache);  this  is  a  signal  indicating  the  instruction 
fetch  was  found  in  the  trace  cache  and  can  be  read  immediately  without  decoding.  The  place  labeled 
OUT  TLB  represents  the  output  interconnect  going  to  the  Translation  Lookaside  buffer;  this  is  a 
signal  indicating  a  trace  cache  miss  so  that  the  instructions  must  be  loaded  from  the  L2  Instruction 
Cache.  The  numbers  on  the  transitions  before  and  after  the  slash  indicate  the  weighting  given  to 
the  transition  and  the  delay  of  the  transition,  respectively.  For  this  instruction  fetch  unit,  we  assume 
95%  (19/20)  of  instructions  are  hits  from  the  cache  and  can  be  fetched  in  one  clock  cycle,  while 
5%  (1/20)  are  cache  misses.  These  assumptions  are,  of  course,  dependent  on  the  workload  being 
simulated;  if  one  wishes  to  use  different  assumptions,  one  can  change  the  model  appropriately.  In 
either  case,  the  determination  of  whether  the  instruction  resides  in  the  TC  can  be  done  in  a  single 
cycle. 

Edges  marked  with  bubbles  indicate  inhibiting  edges  [Mur89].  These  edges  have  the 
semantics  that  if  a  token  is  at  the  place  connected  to  by  the  edge,  the  corresponding  transition  may 
not  fire.  We  use  this  mechanism  to  prevent  multiple  tokens  from  appearing  at  any  given  place.  This 
effectively  emulates  the  notion  that  a  register  may  only  contain  a  single  value,  and  the  inhibiting 
edge  emulates  the  handshaking  signal  from  the  asynchronous  communication  channel. 

The  L2  Instruction  Cache  itself  (including  the  cache  controller)  is  modeled  as  shown  in 
Figure  5.2(b).  This  is  also  an  example  of  how  conflicts  for  resources  are  generally  modeled  in  our 
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(a)  Front  End 


(b)  Register  Rename/ Allocation 


(c)  Execution  Engine 


Figure  5.1:  Example  proeessor  pipeline.  Modules  replieated  for  hyperthreading  support  are  labeled 
with  A/B. 
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(a)  Instruction  Fetch  (b)  L2  Instruction  Cache 


(c)  Instruction  Scheduler 


Figure  5.2:  Example  Petri  net  models  for  proeessor  modules.  Numbers  indieate  transition 
weights/delays. 
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system.  The  eontroller  aeeepts  eaehe  read  requests  from  the  two  separate  TLBs.  The  eaehe  eon- 
troller  must  respond  to  requests  from  the  TLBs  as  they  arrive,  however  if  both  TLBs  have  pending 
requests  for  data,  the  eaehe  eontroller  alternates  between  the  two  for  fairness.  Two  plaees  (one 
eontaining  an  initial  token,  the  other  direetly  below  that)  represent  the  internal  state  of  the  eaehe 
eontroller;  that  is,  whieh  TLB  is  to  be  next  servieed  in  ease  of  a  eonliiet.  The  topmost  and  bottom¬ 
most  transitions  labeled  1/8  fire  when  a  token  arrives  at  the  input  whieh  is  to  be  next  served;  the 
middle  two  transitions  labeled  1/8  eorrespond  to  the  ease  where  there  is  a  request  from  the  TLB 
whieh  was  last  served,  but  there  is  no  request  from  the  other  TLB.  In  this  latter  ease  the  eontroller 
must  aeeept  the  request  even  though  it  breaks  the  alternation  between  TLBs. 

The  delay  of  8  on  the  transitions  in  the  L2  Caehe  are  used  to  model  the  delay  of  fetehing 
from  the  eaehe  itself.  Again,  the  value  one  ehooses  here  depends  on  the  aetual  eaehe  behavior  one 
wishes  to  model. 

Figure  5.2(e)  shows  part  of  the  Instruetion  Seheduler  module,  whieh  separates  the  types  of 
instruetions  (integer,  floating  point  and  memory  aeeess)  into  separate  queues.  Input  to  the  seheduler 
is  represented  by  the  presenee  of  a  token  at  the  plaee  marked  IN.  The  relative  weights  given  to  the 
transitions  yield  the  distribution  of  instruetion  types. 

Queues  make  up  a  signifieant  portion  of  the  Pentium  4  proeessor  design.  Figure  5.2(d) 
illustrates  the  Petri  net  representation  for  a  queue  whieh  ean  hold  up  to  three  elements  (the  queues 
used  for  our  example  proeessor  were  all  64-element  queues).  The  token  represents  the  state  of  the 
queue  (the  number  of  elements  in  the  queue).  Any  transition  whieh  removes  a  token  from  the  input 
plaee  inerements  the  state,  while  adding  a  token  at  the  output  plaee  deerements  the  state.  The  delays 
of  1  on  the  inerement  transitions  ensure  that  the  queue  ean  only  eonsume  one  token  from  the  input 
plaee  per  eloek  eyele.  The  delays  of  0  on  the  deerement  transitions  allow  tokens  to  be  both  added 
to  the  queue  and  removed  from  the  queue  in  a  single  eloek  eyele;  sinee  the  queues  all  feed  modules 
whieh  eonsume  at  most  one  token  per  eloek  eyele,  the  inhibiting  edges  on  the  deerement  transitions 
ensure  that  only  one  token  ean  be  removed  from  the  queue  per  eloek  eyele,  as  should  be.  The 
exeeption  to  this  is  the  integer  unit  instruetion  seheduler,  whieh  ean  issue  two  integer  instruetions  at 
onee.  Here  this  is  modeled  using  two  possible  token  eonsumers  in  parallel. 

The  model  we  use  for  intereonneet  is  shown  in  Figure  5.2(e).  Here  there  is  a  single 
transition  with  an  inhibiting  edge,  representing  an  asynehronous  communications  channel  of  delay 
X  with  end-to-end  handshaking.  If  one  wishes  to  model  a  pipelined  interconnect  strategy,  several 
such  stages  can  be  placed  in  series,  and  each  given  a  delay  of  1 . 

We  chose  the  factor  relating  wire  length  to  delay  to  yield  an  8  cycle  delay  for  a  wire  con- 
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(b)  Sensitivity  Weighted 


Figure  5.3:  Floorplans  for  example  proeessor. 


Wirelength 

Throughput 

Time 

Sensitivity  Weighting 

37254 

0.71 

3494 

Uniform  Weighting 

18456 

0.63 

383 

Mixed  Weighting 

23766 

0.67 

3546 

Table  5.1:  Results  for  asynehronously-eommunieating  lloorplan  experiment.  Throughput  in  instrue- 
tions  per  eyele.  Time  is  exeeution  time  in  seeonds  on  a  450  MHz  Pentium  2  maehine. 


neeting  opposite  eorners  of  a  square  die  of  suffieient  area  to  eontain  the  entire  Pentium  4  design.  For 
the  simulation  runs,  10  million  simulated  maehine  eyeles  were  used  to  derive  the  net  sensitivities. 

5.5.2  Experimental  Results 

We  eompare  the  results  from  our  asynehronously-eommunieating  system  lioorplanner 
with  the  same  tool  with  uniform  net  weighting  (i.e.  without  any  simulation  or  sensitivity  analy¬ 
sis)  in  Table  5.1.  In  the  latter  ease  there  is  no  need  for  iteration  in  the  algorithm.  Figure  5.3  shows 
the  lloorplan  results. 

We  ean  see  that,  as  expeeted,  using  the  net  weights  based  on  sensitivities  degrades  the 
wirelength,  albeit  more  substantially  than  we  would  like.  We  also  show  results  for  a  “mixed”  weight- 
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ing  scheme,  where  we  take  the  largest  sensitivity  among  all  interconnects  and  add  that  figure  to  all 
net  weightings.  This  essentially  provides  an  even  weighting  between  the  performance  optimization 
metric  and  the  wirelength  optimization  metric.  Such  a  mixed  weighting  scheme  allows  the  designer 
to  balance  between  the  wirelength  and  the  system  performance  as  desired. 

5.6  Summary 

We  have  demonstrated  a  system  for  the  floorplanning  asynchronously-communicating 
system  which  takes  advantage  of  simplified  modeling  to  allow  performance  estimation  within  an 
iterative  improvement  scheme.  We  use  sensitivities  of  performance  with  respect  to  interconnect 
delays  as  a  net  weighting  approach  to  optimize  for  performance. 

The  primary  difficulty  with  this  approach  lies  in  obtaining  adequate  architectural  models 
for  the  system.  However,  in  the  real  world,  an  architectural  level  design  is  typically  developed  before 
designers  commence  work  on  a  more  detailed  level.  We  believe  an  abstract  model  can  be  developed 
readily  starting  from  this.  Alternatively,  we  may  seek  techniques  for  developing  an  abstract  model 
from  a  cycle-true  simulation,  possibly  similar  in  nature  to  the  techniques  proposed  in  [LSLH04].  It 
remains  to  be  seen  how  accurate  such  an  approach  would  be,  however. 
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Chapter  6 

Clock  Skew  Scheduling  Under 
Variability 

6.1  Introduction 

The  manufacturing  process  inevitably  introduces  differences  between  the  idealized  design 
generated  by  the  circuit  designer  and  the  final  manufactured  product.  A  goal  of  circuit  design  is  to 
create  a  final  design  which  is  robusf  fo  such  process  variation.  A  nafural  approach  fo  dealing  wifh 
fhis  problem  is  fo  design  conservafively;  given  esfimafes  as  fo  how  large  fhe  process  variafions  are, 
fhe  designer  can  consider  fhe  worsf-case  variafions,  and  accounf  for  fhese  during  fhe  initial  design 
process. 

Process  variafion  affecls  fhe  problem  of  clock  skew  scheduling.  Variabilify  of  fhe  delays 
presenf  in  fhe  clock  free  disfribufion  nefwork  of  a  synchronous  digifal  circuif  can  change  fhe  timing 
of  registers,  and  hence  can  affecl  fhe  performance  of  fhe  design.  Conservative  design  in  fhis  case 
means  fhaf  fhe  clock  skew  scheduling,  or  assignmenf  of  lafency  fo  fhe  regisfers,  should  accounf  for 
fhe  worsf-case  clock  free  variabilify. 

In  fhis  chapter  we  presenf  fhe  problem  of  clock  skew  scheduling  given  fhe  presence  of 
uncerfainly  in  fhe  delays  associated  wifh  fhe  clock  disfribufion  free  in  a  digifal  synchronous  circuif. 
A  model  for  fhis  problem  and  a  sufficienf  condifion  for  ifs  optimal  solufion  is  presenfed.  This 
condifion  can  be  used  as  a  slopping  criterion  for  an  ileralive  algorilhm  for  fhis  problem. 
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6.2  Notation 

Let  G  =  {VlUVc,ElU Ec)  be  the  direeted  graph  representing  our  eireuit,  where  Vl  is 
the  set  of  leaf  nodes  in  the  elock  tree  (registers),  Vc  is  the  set  of  internal  nodes  in  the  eloek  tree, 
ElGVlX  Vl  is  the  set  of  edges  representing  eombinational  delay  paths  between  the  leaf  nodes,  and 
Ec  GVc  X  {Vl  U  Vc)  is  the  set  of  edges  representing  the  eloek  distribution  from  the  internal  nodes 
of  the  eloek  tree.  All  edges  are  direeted  in  the  direetion  of  the  signal  or  eloek  flow. 

Let  5,y  G  M  be  the  delay  assoeiated  with  eaeh  eombinational  delay  edge  {ij)  G  El-  Let 
fiij  G  M  be  the  minimum  delay  assoeiated  with  eaeh  internal  eloek  tree  edge  (iJ)  G  Ec- 

Let  r  G  M  be  the  eloek  period. 

Let  s  G  [0, 1]  be  the  faetor  of  uncertainty  in  delay  in  the  clock  tree.  A  nominal  delay  of  d 
corresponds  to  an  actual  delay  in  the  range  [(1  —  s)(i,  (1  +£)r/]. 

We  require  the  clock  tree  subgraph  Gc  =  {VlUVc,Ec)  be  a  tree.  That  is,  there  is  a 
distinguished  vertex  vq  G  Vc  called  the  root  which  has  no  incoming  edges,  and,  for  every  other 
vertex  v  G  (VlU  Vc)  \  {vo}^  there  is  a  unique  path  in  Gc  from  vq  to  v.  Equivalently,  every  vertex 
except  the  root  must  have  exactly  one  incoming  edge  in  Ec- 

For  each  edge  (iJ)  G  El,  there  is  an  associated  vertex  in  Vc  representing  the  lowest  com¬ 
mon  parent  between  the  endpoints  of  that  edge.  Take  the  vertices  which  appear  both  in  the  path 
from  VQ  to  i  and  in  the  path  from  vq  to  j-  Among  these  vertices,  the  one  which  is  furthest  away  from 
Vq  (has  a  path  from  vq  with  the  greatest  number  of  edges)  is  the  lowest  common  parent  of  {i,j)-  Let 
Pij  G  Vc  denote  the  lowest  common  parent  of  {i,j)  G  El-  It  is  easy  to  show  that  Pij  is  unique  given 
(h;)- 

6.3  Problem  Formulation 

We  wish  to  find  latency  assignments  for  all  vertices  which  minimizes  the  clock  period 
while  satisfying  the  timing  requirements  for  the  graph.  Ignoring  the  clock  distribution  tree  and  the 
uncertainty  in  delay,  this  is  equivalent  to  finding  (r,x)  with  x,  G  M,  /  G  Vl  which  satisfies 

minT 

Xi  +  bij  -  T  <Xj,  (iJ)  G  El 

This  is  recognizable  as  the  LP  dual  of  the  ordinary  maximum  mean  cycle  problem. 

When  the  clock  distribution  network  is  added  with  uncertainty  in  the  associated  delays. 
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we  introduce  additional  variables  x,-  G  M,  /  G  Vc,  and  the  problem  becomes 

minr 

Xi  +  e{xi  -  xp.j )  +  6ij-T  <  xj  -e{xj-  xp.j ),  (/,  j)  G  Ep 

Xi  +  Hij<Xj,  {i,j)eEc 

which  can  be  rewritten  as 

minr 

—{l+£)xi  +  {l—£)xj  +  2exp^j  +  T>5ij,  {i,j)^Ep  (6-D) 

-Xi+Xj>^ij,  {iJ)eEc 

We  denote  this  problem  (6.D)  to  indicate  that  this  is  the  dual  of  the  maximum  mean  cycle 
problem  with  clock  tree  latencies. 

Note  that  (6.D)  has  one  constraint  for  every  edge  in  G.  Given  an  assignment  for  (r,x) 
which  is  feasible  in  (6.D),  we  say  an  edge  is  critical  iff  the  corresponding  constraint  is  satisfied 
strictly  at  equality. 

Claim  6.1.  (6.D)  has  a  feasible  solution. 

Proof  of  Claim  6.L  One  can  easily  construct  a  feasible  solution.  Assign  latencies  x,  in  topological 
order  starting  from  vq,  according  to  the  inequalities  corresponding  to  the  edges  in  Ec-  After  all 
latencies  have  been  assigned,  compute  T  to  satisfy  all  inequalities  corresponding  to  the  edges  in 
El.  □ 

Claim  6.2.  (6.D)  has  an  optimal  solution  where  all  latencies  x,-  are  non-negative. 

Proof  of  Claim  6.2.  If  (r,x)  is  feasible  in  (6.D),  then  (r,x  +  9  •  1)  is  also  feasible  in  (6.D),  where 
9  G  M  and  1  is  the  vector  with  all  elements  equal  to  1.  Let  (r,x)  be  optimal  in  (6.D),  and  choose 
9  =  —  minx,.  □ 

6.4  Fundamental  Theorem  Of  Markov  Chains 

Here  we  state  the  Fundamental  Theorem  Of  Markov  Chains,  which  we  make  use  of  later. 
First,  however,  we  present  some  definitions. 

Definition  6.3  (Stochastic  Vector).  A  vector  p  G  M"*  is  said  to  be  stochastic  iff  each  entry  p,-  G  [0, 1] 
andi:r^iP,  =  l. 
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Definition  6.4  (Column  Stochastic  Matrix),  A  matrix  B  is  column  stochastic  iff  every  eolumn  of 
B  is  stochastic. 

Theorem  6.5.  (Fundamental  Theorem  Of  Markov  chains)  IfB  is  a  square  column  stochastic  matrix, 
then  there  exists  a  stochastic  vector  p  such  that  Bp  =  p. 

Proof  of  Theorem  6.5.  Given  in  [Mon].  As  this  source  is  no  longer  generally  available,  we  briefly 
reproduce  the  salient  points  of  the  proof  here.  Let  K  be  the  set  of  all  stochastic  vectors  in  M"*.  Let  B 
be  a  square  column  stochastic  matrix.  We  make  the  following  observations: 

Observation  6.6.  K  is  bounded.  That  is,  there  exists  M  G  M,M  >  0  such  that  the  Euclidean  norm 
||x||  <  Mfor  all  xe  K. 

Observation  6.7.  K  is  closed. 

Observation  6.8.  K  is  convex. 

Observation  6.9.  For  all  x  £  K,  Bx  £  K. 

Lemma  6.10.  (Markov-Kakutani  Theorem)  Let  T  be  an  affine  transformation  and  K  a  non-empty 
compact  convex  subset  o/M'”.  IfT  maps  K  into  K  them  T  has  a  fixed  point  in  K. 

Proof  of  Lemma  6.10.  Choose  any  z  £  K  and  define  x„  =  ^(z  +  Tz  +  T^z  +  . . .  +  r”^'z)  where 
T^  =  T  o  T^^^  and  T^  =  T .  By  convexity  of  K,  x„  G  K.  By  compactness  of  K,  there  exists  a 
convergent  subsequence  (x„^.)“^j  with  limit  x£  K.  Since  (x„^.)  ^  x  and  T  is  continuous  (since  T  is 
affine),  then  (Tx^j)  Tx.  By  the  Heine-Borel  theorem,  K  is  bounded  so  there  is  an  M  >  0  such 
that  ||a||  <  M  for  all  a  G  .  Therefore 

||x„-rx„||  =  ||-(z+rz+r^z+...  +  r”^'z)  -  -(rz  +  r^z+r^z+...  +  r”z)|| 
n  n 

n 

<'-m  +  \\r'i\\) 

n 

<  -(M  +  M) 

n 

Replacing  n  by  nj  we  get  0  <  ||x„^.  —  rx„J|  <  ™  and  letting  y  ^  oo  we  conclude  0<  ||x  —  rx||  <0. 
This  means  x  =  Tx.  Thus  x  is  a  fixed  point  of  T .  □ 

Now  let  T  be  defined  as  the  affine  transformation  T (x)  =  Bx.  By  Observation  6.9,  T 
maps  K  into  K.  By  Observation  6.6  and  Observation  6.7,  and  using  the  Heine-Borel  Theorem,  K  is 
compact.  By  Lemma  6.10,  T  has  a  fixed  point  p  G  □ 
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6.5  Theoretical  Results 

In  this  section,  we  present  a  sufficient  condition  for  optimality  in  the  latency  assignment 
problem  with  uncertainty. 

Theorem  6.11.  Let  {T,x)  be  a  latency  assignment  feasible  in  (6.D).  If  every  vertex  in  (VlU  Vc)  has 
at  least  one  outgoing  edge  which  is  critical,  then  (T,x)  is  an  optimal  solution  o/(6.D). 

Proof  of  Theorem  6.11.  Let  Ez  be  the  set  of  edges  which  are  critical  for  (6.D)  under  the  assignment 
(r, x) .  Let  E'[^  =  ElD Ez  and  let  E'(^  =  EcC\ Ez.  Consider  the  following  linear  program: 

minr' 

—  (1  +  L)x'^  +  (1  —  ^)x'j  +  2.zx'p..  +  >  0,  (/, y)  G  E'p  (6.RD) 

-x-  +  xy>0,  {i,j)eEc 

We  denote  this  problem  (6.RD)  to  indicate  that  this  is  the  restricted  dual  problem  for 
(6.D)  under  the  assignment  (r,x).  The  restricted  dual  has  the  interpretation  that  a  feasible  solution 
to  (6.RD)  indicates  a  direction  in  which  we  can  move  from  (r,x)  so  as  to  remain  feasible  in  (6.D). 
That  is,  given  a  feasible  solution  {T' ,x!)  to  (6.RD),  we  can  find  some  9  G  M,9  >  0  such  fhaf  {T  + 
9r',x  +  9x')  is  feasible  in  (6.D). 

Note  fhaf  0  is  feasible  in  (6.RD),  so  af  opfimalify  we  musf  have  T'  <  0. 

Lemma  6.12.  (T,x)  is  not  optimal  for  (6.D)  iff  the  optimal  solution  for  (6.RD)  has  T'  <  0. 

Proof  of  Lemma  6.12.  Suppose  (r,x)  is  nof  optimal  for  (6.D).  Lef  fhe  optimal  solution  of  (6.D)  be 
(r*,x*).  Now  lake  (r',x')  =  {T*  -  T,x*-x).  (r',x')  musf  be  feasible  in  (6.RD),  and  T'  <  0,  so 
fhe  optimal  solution  for  (6.RD)  has  T'  <  0. 

Now  suppose  fhe  optimal  solution  for  (6.RD)  is  {T' ,x!),  where  T'  <  0.  Then  we  can  find 
9  G  M,9  >  0,  such  fhaf  {T  +  9r',x  +  9x')  is  feasible  in  (6.D).  Bui  T  +  9r'  <  T,  so  (r,x)  is  nof 
optimal  for  (6.D).  □ 

Consider  fhe  following  linear  program  derived  from  (6.RD): 

min  — 1 

—  (1 +s)x-  + (1  — s)xy  +  2sxp.^.  >  1,  {i,j)^Ep  (6.RD  ) 

-Xi  +  Xj>0,  {iJ)eEc 

Lemma  6.13.  (6.RD)  has  T'  <0  at  optimality  iff  {6. RD')  has  a  feasible  solution. 
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Proof  of  Lemma  6.13.  Suppose  (6.RD)  has  an  optimal  solution  {T*,x*),  where  T*  <  0.  Then  x'  = 
—x*/T*  is  feasible  in  (6.RD'). 

Now  suppose  (6.RD')  has  a  feasible  solution  x'.  Then  (T'  =  — l,x')  is  feasible  in  (6.RD), 
so  (6.RD)  has  T'  <  0  at  optimality.  □ 

The  dual  of  (6.RD')  is 

-1+max  y  yij 

(i~£)  ^  yiv~(i  +  £)  ^  yvj~\-  ^  z/v  =  0 

(v,i)G£[  (/,v)Gi?^ 

V,  ^iv  ~  ^  Zvj  +  2S  ^  Jij  =  0 

{!»g£c  Pij=v,{i,j)€E[ 

yij  >  0 
Zij  >  0 

Lemma  6.14.  (Parkas’  Lemma)  (6.RD')  is  infeasible  iff  (6.RP')  has  some  feasible  solution  for 
which  f^yij  >  0. 

Proof  of  Lemma  6.14.  Consider  the  ease  where  (6.RP')  has  no  feasible  solution  for  whieh  Y,yij  >  0. 
Note  that  0  is  a  feasible  point  in  (6.RP'),  so  the  optimal  value  of  (6.RP')  is  -1.  By  duality,  the  optimal 
value  of  (6.RD')  is  -1  as  well,  and  henee  (6.RD')  is  feasible. 

Now  eonsider  the  ease  where  (6.RP')  has  some  feasible  solution  for  whieh  Y^yi]  >  0. 
Observe  that  if  (y, z)  is  feasible  in  (6.RP'),  then  for  all  9  G  M,  9  >  0,  (9  •  y,  9  •  z)  is  also  feasible  in 
(6.RP').  Thus  the  value  of  (6.RP')  is  unbounded,  so  by  duality  (6.RD')  is  infeasible.  □ 

Suppose  every  vertex  v  G  14  U  Vc  has  at  least  one  outgoing  edge  whieh  is  eritieal.  Take  an 
arbitrary  subset  of  the  eritieal  edges  E'^  4  Ez  so  that  the  graph  Gz  =  (14  U  Vcff'z)  has  exaetly  one 
outgoing  edge  for  every  vertex.  Consider  the  following  problem  generated  by  adding  eonstraints  to 


VG14 


V  G  Vc 


(6.RP') 


ihj)  £  ■S'l 

{hi)  £  E'q 
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(6.RP') 


(;,v)g£[  (v,i)G£[ 

^  Zi'v  ^  j  “1“  2£ 

{i»G£^  (vJ)eE'^ 


max  ^  yij 

yvj  +  XI 

(/,v)g£'c 

I 


Ziv  —  0) 

vG  Vc 

yij  =  0, 

V  G  Vc 

yij  > 

{hi)  £  e'l 

Zij  >  0, 

{iJ)^E'c 

yij  =  0, 

(hi)  0  -^z 

Zij  =  0, 

(hi)  0  -^Z 

Since  the  above  problem  is  a  restriction  of  (6.RP'),  any  feasible  solution  for  this  problem 
is  also  feasible  for  (6.RP').  Let  E'l  =  E'^DE^  and  E'^  =  E'f^r\ E'^,  and  let  e  :  {Vl  U  Vc)  E^  be 
the  bijective  mapping  from  the  vertices  of  G  to  their  corresponding  outgoing  edges  in  Also, 
we  can  perform  the  substitution  y'  =  (1  +s)y.  Simplifying  the  above  problem  yields  the  following 
equivalent  problem: 

1 


max 


1+e 


X  y'ij 

(i.j)€El 


1  -S 

1+e 


X 

(/,v)Gi?f 


y'iv- 


-y'e{v)+  X  +V  =  0, 

(i,v)G£" 


X  ^iv  ^e(v)  + 

(+)G£^ 


2e 

1+e 


X  ■'tj 

Pij=v,(ij)eE'l 


vGPl 
V  G  Pc 


(6.RP" 


^•>0,  {i,j)eE'l 

Zij>0,  {i,j)eEc 

Note  that  there  is  exactly  one  equation  in  (6.RP")  corresponding  to  each  vertex  in  VcU  Vc- 
For  every  v  G  Vc,  define 


Rv  =  {«  G  Vc  :  (m  =  v)  V  {{u,w)  G  E'(^,w  G  Rv)} 

That  is,  u^Ry  iff  there  exists  a  path  from  n  to  v  whose  edges  are  all  contained  in  E'^. 
Lemma  6.15.  Given  (6.RP"),  for  all  v  G  Vc, 


Ze{v) 


2e 

1+e 


X 


yi 


yPij=u,{i,j)£E'l 
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Proof  of  Lemma  6.15.  We  use  induetion  on  \Rv\.  If  |/?v|  =  1,  then  Ry  =  {v},  and  v  must  have  no 
ineoming  edge  in  E^.  Rearranging  the  equality  in  (6.RP")  eorresponding  to  v  gives 


as  required. 

If  |Rv|  >  1>  then  v  must  have  exaetly  one  ineoming  edge  {w,v)  G  E'^.  Also,  by  definition 
of  Rv,  R„  =  Rv\  {v}.  By  the  induetion  hypothesis, 


Rearranging  the  equality  in  (6.RP")  eorresponding  to  v  yields 


as  required.  □ 


Now  eonsider  only  the  system  of  linear  equalities  in  (6.RP"),  ignoring  the  inequality 
eonstraints.  We  ean  rewrite  this  system  in  matrix  form  by  letting 


P  = 


where  and  Ze(v)  lie  in  the  same  row  in  p  as  the  eonstraint  eorresponding  to  vertiees  u  and  v, 
respeetively,  so  that  Ap  =  0,  where  A  is  the  matrix  representing  the  linear  equalities  in  (6.RP"). 


Lemma  6.16.  If  a  stochastic  vector  p  is  feasible  in  (6.RP"),  then  Ly',>o. 


Proof  of  Lemma  6.16.  Suppose  Y.y'ij  =  0-  By  Lemma  6.15,  every  Zij  can  be  expressed  as  the  sum 
of  a  subset  of  the  y'^j,  times  a  eonstant.  Henee  Zij  =  0  as  well,  so  p  =  0.  But  p  is  stoehastie,  a 
eontradietion.  Thus  some  yh  >  0,  so  'Ly'ij  >0-  D 


Let  B  =  A  +  7,  where  1  is  the  identity  matrix.  We  now  have  Bp  —  7p  =  0,  or  Bp  =  p. 
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Lemma  6.17.  B  is  a  column  stochastic  matrix. 

Proof  of  Lemma  6.17.  The  variable  yij,  {i,j)  G  £'[  appears  exaetly  three  times  in  the  linear  system 
of  equalities:  onee  with  eoeffieient  —1  in  the  equality  eorresponding  to  the  vertex  i,  onee  with 
eoeffieient  in  the  equality  eorresponding  to  the  vertex  j,  and  onee  with  eoeffieient  in  the 
equality  eorresponding  to  the  vertex  Pij.  Note  that  these  eoeffieients  sum  to  zero.  The  variable 
Zij,  ihj)  G  appears  exaetly  twiee:  onee  with  eoeffieient  —1  in  the  equality  eorresponding  to  the 
vertex  i,  and  onee  with  eoeffieient  +1  in  the  equality  eorresponding  to  the  vertex  j.  Again,  these 
eoeffieients  sum  to  zero.  Thus  the  eolumns  of  A  must  eaeh  sum  to  zero,  so  by  eonstruetion  the 
eolumns  of  B  must  eaeh  sum  to  1.  Sinee  s  G  [0, 1],  all  the  elements  of  B  lie  in  [0, 1]  as  well.  □ 

Now  we  eomplete  the  proof  of  Theorem  6.11.  For  the  lateney  assignment  {T,x),  we  ean 
eonstruet  the  linear  program  (6.RP")  given  the  eritieal  edges.  Now  from  Lemmas  6.17  and  6.5, 
we  know  there  exists  a  feasible  solution  to  (6.RP").  In  partieular,  the  fixed  point  p  =  [y';z]  of 
Bp  =  p  satisfies  the  equality  eonstraints,  and  sinee  p  is  stoehastie,  it  must  also  satisfy  the  inequality 
eonstraints  p  >  0  of  (6.RP")  as  well. 

By  Lemma  6.16,  this  feasible  solution  of  (6.RP")  must  have  >  0.  But  by  eonstruetion 
of  (6.RP"),  the  point  (y  =  must  be  feasible  in  (6.RP'),  and  Y^yij  >  0. 

By  Lemma  6.14,  (6.RD')  is  infeasible,  so  by  Lemma  6.13,  the  optimal  solution  of  (6.RD) 
has  T'  =  0.  By  Lemma  6.12,  (r,x)  is  optimal  for  (6.D).  □ 

6.6  Summary 

While  we  do  not  propose  a  partieular  iterative  solution  for  the  lateney  assignment  problem 
here,  we  hope  to  modify  one  of  the  many  sueh  algorithms  for  variation-free  optimal  lateney  assign¬ 
ment  to  a  proeedure  whieh  aeeounts  for  variations.  In  this  respeet.  Theorem  6.11  beeomes  valuable 
in  that  it  gives  a  stopping  eriterion  for  an  iterative  solution  to  the  lateney  assignment  problem.  If  a 
given  assignment  is  sueh  that  every  vertex  has  at  least  one  outgoing  edge  whieh  is  eritieal,  then  we 
know  an  optimal  solution  has  been  reaehed.  This  ean  also  be  used  to  guide  the  ehoiee  of  how  to 
modify  the  eloek  network  during  eaeh  step  of  the  iterative  lateney  assignment  algorithm. 

The  analysis  presented  in  this  ehapter  assumes  a  fixed  eloek  tree  topology.  Of  eourse,  a 
designer  may  have  flexibility  in  the  eloek  tree  design,  in  order  to  minimize  the  impaet  of  variability 
on  the  final  performanee.  However,  treatment  of  sueh  design  teehniques  is  outside  the  seope  of  the 
work  presented  here. 
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Chapter  7 

An  Experimental  Nonlinear 
Programming  Technique  For 
Floorplanning 

7.1  Introduction 

In  the  past,  formulation  of  EDA  problems  in  terms  of  general  nonlinear  programs  had 
limited  use.  Software  paekages  eapable  of  taekling  sueh  problems  were  quite  limited  in  the  size  of 
the  problems  whieh  they  eould  handle.  However,  modern  nonlinear  program  solvers  have  reaehed 
a  state  of  maturity  where  they  have  the  potential  to  be  useful  in  many  applieations.  For  example, 
LANCELOT  [CGT92],  a  large-seale  general-purpose  optimization  paekage,  has  been  used  sueeess- 
fully  to  solve  nonlinear  eireuit  optimization  problems  for  designs  over  1500  gates  [VC99].  With  the 
ability  to  taekle  problems  with  over  9000  variables  and  10000  eonstraints,  it  is  worth  examining  the 
applieation  of  LANCELOT  to  other  EDA  problems  of  similar  size. 

In  this  ehapter,  we  deseribe  an  experimental  teehnique  to  use  LANCELOT  in  a  floor¬ 
planning  eontext.  The  floorplanning  problem  is  formulated  and  an  algorithm  whieh  uses  nonlinear 
programming  at  its  eore  is  proposed. 
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7.2  Problem  Formulation 

We  consider  the  problem  of  floorplanning  rectangular  soft  modules  (i.e.  modules  with 
fixed  areas  but  whose  aspect  ratios  may  be  modified)  to  minimize  a  linear  combination  of  total  die 
area  and  wirelength. 

Formally,  we  are  given  a  graph  G  =  {V,E),  with  V  representing  the  modules  to  be  floor- 
planned,  and  E  representing  the  interconnects.  We  are  also  given  areas  Ay  G  M  for  all  modules  v, 
and  parameters  y  G  M,  and  G  M,  representing  the  maximum  allowed  aspect  ratio  for  any 

module,  the  cost  weighting  factor  for  area  and  the  cost  weighting  factor  for  wirelength,  respectively. 
The  goal  is  then  to  find  W  G  M,  //  G  M,  the  overall  width  and  height  of  the  die,  respectively,  and,  for 
each  module  v,  module  locations  Xy  G  M,  yv  £  widths  Wy  G  and  heights  hy  G  M^,  such  that  all 
of  the  following  conditions  hold: 

•  All  modules  fit  on  the  die:  For  all  v  G  F 

Xy  —  ^Wy  >  0  and  Xy  +  ^Wy  <  W 
yv  -  ^hy  >  0  and  yy  +  ^hy  <  H 

•  Modules  fit  in  their  allocated  areas:  For  all  v  G  F 

Wyhy=Ay  (7.1) 


•  No  modules  overlap:  For  all  u  G  F,  v  G  F, «  /  v 

\xu-Xy\>^{wu  +  Wy)  or  \yu-yv\>^{hu  +  hy)  (7.2) 

•  Modules  respect  the  aspect  ratio  limit:  For  all  v  G  F 

-  <  Wy/hy  <  Y 

Y 

•  A  weighted  sum  of  total  area  and  wirelength  is  minimized.  Here  we  use  the  sum-of-squares 
metric  for  wirelength,  measured  from  the  module  centers.  For  the  cost  function  we  take 

axWE  +  ai  ^  {xu-Xyf  +  {yu-yvf 

{u,v)^E 
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Note  that  some  simplifieation  is  possible;  we  may  eliminate  the  area  eonstraints  (7.1)  and 
substitute  hy  =  Ay  jwy  in  the  above.  Also  note  that  W  and  H  are  eompletely  determined  given  the 
above  eonstraints. 

The  lioorplanning  problem  is  nonlinear.  Moreover,  the  problem  is  not  smooth:  the  non¬ 
overlap  eonstraints  (7.2)  are  not  eontinuously  differentiable  in  the  region  of  interest.  We  take  a 
two-phase  approaeh  to  dealing  with  the  non-smoothness  of  the  problem.  We  first  use  a  smooth 
approximation  to  the  floorplanning  problem  fo  derive  a  layouf  whieh  may  eonfain  some  violations 
of  fhe  above  eonsfrainfs,  buf  is  elose  fo  a  legal  lloorplan.  Given  Ibis  approximafe  solufion,  we  apply 
a  legalizafion  proeedure  whieh  resolves  fhe  violafions  fo  produee  a  valid  lloorplan. 

7.3  Existing  Work 

In  nearly  all  exisling  works  in  high-level  floorplanning,  fhe  resolution  of  fhe  non-overlap 
eonsfrainfs  is  performed  Ihrough  fhe  generalion  of  lopologieal  eonsfrainfs  for  fhe  modules.  A  eom- 
mon  represenfafion  of  fhese  lopologieal  eonsfrainfs  is  an  HV-constraint  graph  pair,  whieh  is  a  pair 
of  aeyelie  direeled  graphs,  where  eaeh  graph  eonlains  a  verlex  for  eaeh  module.  In  fhe  H-eonslrainl 
graph,  a  direeled  edge  (n,  v)  represenls  fhe  eonslrainl  Xy>  Xu  +  ^  {wu  +  Wy)  (an  H-constraint),  while 
in  fhe  V-eonslrainf  graph,  a  direeled  edge  (n,  v)  represenls  fhe  eonslrainl  jy  >  yn  +  2  +  hy)  (a  V- 

constraint).  Addilionally,  for  every  pair  of  modules  {u,v},  Ihere  is  exaelly  one  edge  belween  Ihose 
modules  (eilher  {u,v)  or  (v, «))  in  fhe  HV-eonslrainl  graph  pair. 

Given  an  HV-eonslrainl  graph  pair,  fhe  non-overlap  eonsfrainfs  ean  be  simplified  fo  linear 
eonsfrainfs.  Moreover,  if  is  easy  fo  see  lhal  any  solution  fo  fhe  floorplanning  problem  eorresponds 
fo  al  leasl  one  (lhal  is,  nol  neeessarily  unique)  HV-eonslrainl  graph  pair.  Based  on  Ibis  observa- 
lion,  mosl  of  fhe  exisling  floorplanning  works  approaeh  floorplanning  as  a  diserele  seareh  problem 
Ihrough  fhe  spaee  of  possible  HV-eonslrainl  graph  pairs,  fypieally  using  simulafed  annealing  as  fhe 
eore  algorilhm.  Examples  of  Ibis  approaeh  inelude  fhe  sequenee-pair  leehnique  [MFNK96],  fhe 
bounded  slieeline  grid  [NFMK96],  fhe  0-Tree  represenfafion  [GCY99],  and  fhe  eorner  bloek  lisl 
[H+00].  Diserele  seareh  leehniques  are  problemalie  fo  implemenl  in  a  general  nonlinear  program¬ 
ming  framework,  however. 

A  number  of  works  address  fhe  nonlinearities  arising  from  fhe  area  eonsfrainfs  (7.1). 
[CKOO]  uses  a  linear  programming  approaeh  fo  approximafe  fhese  eonsfrainfs.  [MK98]  uses  a  Irans- 
formalion  leehnique  fo  eonslruel  a  eonvex  oplimizalion  problem  wilh  eonvex  eonsfrainfs.  Allhough 
fhese  leehniques  are  useful  in  simplifying  fhe  area  eonsfrainfs  (7.1),  Ihey  require  an  HV-eonslrainl 
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graph  pair  be  provided  as  input,  and  so  are  not  eomplete  solutions  to  the  floorplanning  problem 
in  themselves.  That  is,  they  must  be  augmented  with  a  strategy  to  seareh  the  spaee  of  feasible 
HV-eonstraint  graph  pairs. 

7.4  Technique 

Here  we  present  our  proposed  nonlinear  floorplanning  teehnique. 

7.4.1  Overall  Flow 

Algorithm  7.1  gives  the  overall  flow  for  the  nonlinear  floorplanning  teehnique.  The  pro- 
eedure  is  iterative,  and  alternates  between  two  subroutines,  PACK  and  SMOOTH.  PACK  finds  a  legal 
paeking  for  the  modules  given  their  eurrent  loeations  and  aspeet  ratios  as  a  guess  as  to  their  relative 
plaeements.  SMOOTH  formulates  a  nonlinear  program  for  a  smoothed  version  of  the  floorplanning 
problem  and  solves  it  using  the  eurrent  loeations  as  the  starting  point  for  LANCELOT.  PACK  and 
Smooth  are  deseribed  in  Seetions  7.4.2  and  7.4.3,  respeetively. 

Algorithm  7.1  Overall  Nonlinear  Floorplanning  Algorithm 
1:  Input:  initial  module  loeations  and  aspeet  ratios 

2:  Pack  (Algorithm  7.2) 

3:  Measure  wirelength 

4:  repeat 

5:  Smooth  (Algorithm  7.3) 

6:  Pack  (Algorithm  7.2) 

7:  Measure  wirelength 

8:  until  no  improvement 


The  starting  point  for  the  algorithm  is  arbitrary  and  need  not  be  legal.  We  initially  ehoose 
module  loeations  randomly  and  set  aspeet  ratios  to  1.  A  eall  to  PACK  is  made  prior  to  the  first  eall 
to  Smooth  in  order  to  resolve  any  initial  overlaps  among  the  modules.  This  is  in  order  to  provide 
good  initial  positions  to  LANCELOT  for  the  smoothed  floorplanning  proeedure. 
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7.4.2  Packing  Algorithm 

The  packing  algorithm  is  shown  in  Algorithm  7.2.  The  goal  here  is  to  take  the  current 
module  locations  and  aspect  ratios,  which  may  violate  some  of  the  floorplanning  constraints,  and 
derive  a  legal  floorplan  within  the  die  boundary  with  minimally  perturbation.  First,  topological 
constraints  are  extracted  from  the  current  module  locations.  Given  these  additional  constraints,  a 
nonlinear  program  is  formulated  for  the  new  constrained  floorplanning  problem.  LANCELOT  is 
then  invoked  to  solve  this  problem. 

Algorithm  7.2  Packing  Algorithm 
1:  Input:  current  module  locations 

2:  Extract  topological  constraints  from  module  locations 

3:  Generate  nonlinear  program  for  packing  using  topological  constraints 

4:  Invoke  LANCELOT  to  solve  nonlinear  program 


We  use  the  following  technique  to  derive  an  HV-constraint  graph  pair  from  the  current 
module  locations.  Eor  every  pair  of  modules  {u,v},  we  must  choose  whether  to  add  an  H-constraint 
edge  or  a  V-constraint  edge  between  u  and  v,  and  we  wish  this  choice  to  fall  naturally  from  the 
current  locations  and  aspect  ratios  of  the  modules.  Assume  for  the  moment  <  Xy  and  y„  < 
and  the  modules  do  not  overlap  in  their  current  placement.  We  are  particularly  interested  in  the 
following  properties: 

•  If  there  exists  a  horizontal  line  which  passes  through  both  modules  u  and  v,  we  must  add 
a  constraint  in  the  H-constraint  graph  between  the  modules.  Eigure  7.1(a)  illustrates  this 
situation.  In  this  case,  adding  a  V-constraint  between  the  modules  does  not  make  sense,  as 
the  current  placement  would  already  violate  such  a  constraint. 

•  If  there  exists  a  vertical  line  which  passes  through  both  modules  u  and  v,  we  must  add  a  con¬ 
straint  in  the  V-constraint  graph  between  the  modules.  Eigure  7.1(b)  illustrates  this  situation. 
This  is  analogous  to  the  previous  property. 

•  If  neither  a  vertical  nor  horizontal  line  exists  which  intersects  both  modules,  we  have  flex¬ 
ibility  to  choose  add  either  an  H-constraint  or  V-constraint  between  the  modules.  See  Eig¬ 
ure  7.1(c)  for  an  example. 


We  propose  the  following  technique  for  extracting  suitable  topological  constraints  be¬ 
tween  the  modules.  Suppose  we  could  scale  the  modules  (that  is,  increase  their  sizes)  uniformly 
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(a)  H-constraint 
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(b)  V-constraint 


U 


(c)  Either  H  or  V 


Figure  7.1:  Examples  for  topologieal  eonstraint  determination.  Desired  eonstraint  types  indieated. 


and  eontinuously  until  u  and  v  abut  along  some  edge.  The  abutting  edge  indieates  a  natural  eon¬ 
straint  between  the  modules.  If  the  abutting  edge  of  the  sealed  modules  were  horizontal  we  add  a 
vertieal  eonstraint  between  the  modules.  Likewise,  if  the  abutting  edge  were  vertieal,  a  horizontal 
eonstraint  is  natural.  This  proeedure  eaptures  the  properties  for  the  eonstraints  as  deseribed  above 
and  ean  also  be  easily  eomputed.  We  formally  determine  the  relative  eonstraint  between  modules  u 
and  V  by  eomputing  the  following  sizing  factors: 


_  2\xu-Xy 

I 

’  Wu  +  Wy 


,  jy  2\yu-yv\ 


We  observe  the  following  properties  of  these  sizing  faetors: 


Observation  7.1,  If  the  modules  u  and  v  are  kept  in  their  current  locations  but  resized  by  the  sizing 
factor  kf  ^  simultaneously,  the  resulting  floorplan  will  have  the  right  edge  ofu  collinear  with  the  left 
edgeofv;  that  is,  Xu  +  k^^^{^Wu)  =  Xy  —  ^(jWv).  Likewise,  if  the  modules  are  scaled  by  kl^v,  then 

the  top  edge  ofu  becomes  collinear  with  the  bottom  edge  ofv. 


Observation  7.2.  If  the  modules  u  and  v  are  resized  by  some  factor  s  which  is  less  than  k^  y,  then 
the  right  edge  ofu  will  lie  to  the  left  of  the  left  edge  ofv;  that  is,  Xu  +  s{jWu)  <  Xy  —  s{jWv).  An 
analogous  situation  arises  in  the  y-direction. 

Observation  7.3.  The  modules  u  and  v  can  be  scaled  up  to  a  factor  o/max(k^  vj  ^,v)  before  overlap 
results  between  the  modules. 


Based  on  these  observations,  we  add  an  H-eonstraint  between  u  and  v  if  kl,v  >  Kv’  other¬ 
wise  we  add  a  V-eonstraint. 


Note  that  the  ealeulation  of  ^  and  1^u,v  do  not  depend  on  the  previous  assumptions  that 
Xu  <  Xy  and  <  yv',  that  is,  regardless  of  the  relative  loeations  of  the  modules,  the  eomparison  of 
y  and  ku,v  still  generates  the  proper  eonstraints  between  u  and  v.  Moreover,  we  ean  also  remove 
the  assumption  that  the  eurrent  module  loeations  do  not  overlap.  We  thus  determine  the  direetion 
of  the  eonstraint  edge  between  u  and  v  by  the  relative  loeations  of  the  modules.  For  instanee,  if  we 
must  add  an  H-eonstraint  edge,  this  edge  is  ehosen  to  be  (u,v)  if  <  Xy,  else  we  ehoose  edge  (v,  u). 

In  theory,  degenerate  situations  may  arise  where  =  Xy  and  y„  =  yy,  i.e.  the  modules  are 
exaetly  eoineident,  or  when  k^y  =  ^,v-  In  sueh  eases,  the  physieal  information  provides  no  useful 
guide  to  the  eonstraint  addition  proeess,  and  we  ean  only  ehoose  a  eonstraint  arbitrarily. 

Onee  we  have  generated  an  HV-eonstraint  graph  pair,  the  non-overlap  eonstraints  are 
simplified  to  beeome  linear  inequalities.  The  resulting  system  is  passed  to  LANCELOT  to  be  solved. 

7.4.3  Smoothed  Floorplan  Algorithm 

The  paeking  algorithm  of  Seetion  7.4.2  is  limited  in  that  it  is  useful  for  resolving  violations 
of  the  floorplanning  eonstraints  deseribed  in  Seetion  7.2,  but  does  not  explore  larger  movements  of 
modules.  We  propose  using  the  smoothed  floorplanning  algorithm  here  as  a  method  for  generating 
global  movements  of  modules.  This  algorithm  is  shown  in  Algorithm  7.3. 

Algorithm  7.3  Smoothed  Floorplan  Algorithm 
1:  Generate  smoothed  version  of  non-overlap  eonstraints 

2:  Invoke  LANCELOT  to  solve  nonlinear  program 


Reeall  the  non-overlap  eonstraints  are  (7.2): 

\Xu-Xy\  >  ^{Wu+Wy)  or  \yu-yv\  >  ^{hu+hy) 

These  eonstraints  are  non-smooth  due  to  the  disjunetive  or.  Our  goal  is  to  relax  these 
eonstraints  so  that  the  problem  given  to  LANCELOT  is  smooth  and  more  readily  solvable.  Sinee 
we  ean  use  the  paeking  algorithm  (Algorithm  7.2)  to  resolve  overlaps  in  the  final  solution,  we  do  not 
need  to  eonsider  module  overlaps  at  this  stage.  Note  that  for  a  fixed  (x„,y„),  the  eonstraints  define  a 
“keepout”  reetangle  whieh  eonstrains  the  plaeement  of  module  v. 

The  non-overlap  eonstraints  ean  be  written  in  terms  of  the  £“-norm  as: 
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where 


Xu 

Xv 

2/(w„-hWv) 

Pm  = 

,  Pv  = 

)  Qmv  — 

yu 

2/ {hu  +  hy) 

The  idea  we  use  here  is  to  replaee  the  non-smooth  £“-norm  with  the  smooth  £^-norm. 
This  smoothing  yields 


q«v(pH-Pv)^l2  >  1 


or 


{Wu  +  WvY 


{Xu-X^f  + 


{hu  T 


^yu-yvf  >  1 


(7.3) 


The  keepout  region  beeomes  an  ellipse  under  this  relaxation  as  shown  in  Figure  7.2. 


Figure  7.2:  Smoothing  of  non-overlap  eonstraints.  Shading  indieates  feasible  region  for  module  v 
relative  to  module  u. 

In  theory,  higher-order  norms  ete.)  yield  smooth  approximations  whieh  are  eloser 

to  the  original  non-overlap  eonstraints.  In  praetiee,  we  found  the  £^-norm  gave  just  as  good  results 
using  LANCELOT  with  mueh  faster  run  times. 

7.5  Experiments 

To  illustrate  the  algorithm,  Figure  7.3  shows  the  results  of  our  nonlinear  programming 
lioorplanner  when  run  on  the  abstraeted  Alpha  design  deseribed  in  Seetion  3.4.  Figure  7.3(a)  shows 
a  floorplan  generated  by  starting  with  a  random  arrangement  and  applying  the  paeking  algorithm. 
Figure  7.3(b)  shows  the  result  after  the  first  applieation  of  the  smoothed  floorplanning  proeedure; 
modules  are  shown  as  ellipses  to  illustrate  the  effeets  of  the  smoothing  on  the  overlaps  of  the  mod¬ 
ules.  Figure  7.3(e)  shows  the  result  after  paeking  this  plaeement. 
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(b)  Smoothed  (wlen  =  17201)  (c)  Packed  (wlen  =  25221) 


Figure  7.3:  Nonlinear  programming  floorplan  for  abstraet  Alpha  21264  design.  Nets  are  shown 
in  blaek  lines,  and  sum-of-square  wirelength  metries  indieated.  Visualizations  are  independently 
sealed  so  that  absolute  sizes  may  not  be  eomparable  between  figures. 
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Figure  7.3(b)  illustrates  the  prime  shorteoming  of  our  floorplanning  algorithm.  There 
tends  to  be  mueh  wasted  area  as  the  smoothed  eonstraints  fail  to  spread  the  modules  evenly  over  the 
die  area.  The  poor  spreading  in  the  solution  to  the  smooth  floorplan  earries  forward  to  the  paeking. 
Figure  7.3(e)  shows  mueh  wasted  spaee  in  the  top  right  eorner  of  the  die.  This  poor  utilization 
remains  regardless  of  the  weighting  faetors  Ka  and  ehosen  for  the  nonlinear  eost  funetion. 

We  show  results  for  our  nonlinear  floorplanning  algorithm  applied  to  problems  derived 
from  the  ISCAS85  eireuit  benehmarks  [Yan91].  The  unmapped  designs  were  eonsidered  as  floor¬ 
planning  instanees,  with  the  nodes  from  the  original  netlist  representing  the  modules  to  be  floor- 
planned.  The  eonneetivity  between  modules  was  taken  as  the  nets  between  the  nodes,  and  the  sizes 
of  the  modules  was  taken  as  proportional  to  the  number  of  minterms  in  the  original  BLIF  represen¬ 
tation.  We  took  the  maximum  aspeet  ratio  parameter  7=  4.  For  this  experiment,  we  arbitrarily  took 
oCa  =  cXl  =  1,  noting  that  for  real  floorplanning  problems  these  faetors  would  have  to  be  eonsidered 
more  earefully,  based  on  the  needs  of  the  designer. 

The  results  of  our  algorithm  are  shown  in  the  eolumns  labeled  Nonlinear  in  Table  7.1.  We 
eompare  these  results  to  a  solution  based  on  simulated  annealing  using  the  sequenee-pair  teehnique 
[MFNK96].  We  ehose  an  annealing  sehedule  whieh  resulted  in  the  same  run  time  as  was  taken 
for  our  nonlinear  floorplanning  algorithm.  This  makes  the  eomparison  “fair”,  in  the  sense  that 
both  teehniques  would  be  given  the  same  amount  of  eomputing  resourees.  The  results  from  the 
annealing  method  are  shown  in  the  eolumns  labeled  Annealing  in  Table  7.1.  We  note  that  the 
nonlinear  algorithm  yields  eonsistently  better  results  in  both  wirelength  and  area  than  the  annealing 
teehnique  when  exeeuted  for  the  same  amount  of  time. 


Design 

Size 

Nonlinear 

Wien  Area  Time 

Annealing 
Wien  Area 

Long 

Wien  Area 

eml50a 

37 

1020 

472 

872 

1198 

584 

994 

362 

eml51a 

21 

325 

320 

426 

465 

380 

298 

214 

eml52a 

12 

107 

96 

128 

120 

122 

98 

84 

em42a 

17 

769 

236 

376 

1125 

301 

741 

212 

em82a 

11 

103 

134 

173 

153 

311 

112 

118 

Table  7.1:  Results  for  nonlinear  floorplanning.  Size  indieates  number  of  modules.  Wien  indieates 
total  wire  length  using  sum-of-squares  metrie.  Area  indieates  total  die  area.  Time  indieates  run  time 
in  seeonds.  Nonlinear  indieates  the  method  deseribed  in  this  ehapter.  Annealing  indieates  quiek 
annealing.  Long  indieates  long  annealing. 


Of  eourse,  setting  the  annealing  sehedule  for  the  same  run  time  as  the  nonlinear  floor- 
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planner  is  an  artificial  restriction  on  the  annealer.  As  a  further  comparison,  we  used  an  annealing 
schedule  which  ran  for  3  hours  on  each  of  the  designs.  The  results  of  this  are  shown  in  the  Long 
columns  of  Table  7.1.  This  yielded  consistently  better  results  than  our  nonlinear  floorplanning.  Al¬ 
though  our  experiments  were  not  exhaustive,  we  could  find  no  technique  to  readily  improve  the 
nonlinear  floorplanning  results. 

7.6  Summary 

We  have  developed  a  new,  nonlinear  programming-based  technique  for  for  floorplanning. 
For  the  same  amount  of  run  time,  our  technique  compares  very  favorably  to  the  annealing-based, 
sequence-pair  technique  proposed  by  [MFNK96].  However,  our  method  compares  poorly  against 
the  annealing-based  method,  given  additional  run  time  for  the  annealer. 

Visual  inspection  of  the  tloorplans  resulting  from  our  nonlinear  programming-based  tech¬ 
nique  suggest  that  there  are  problems  with  the  spreading  of  modules  over  the  die.  This  could  perhaps 
be  alleviated  by  borrowing  techniques  from  placement,  e.g.  the  dissection  technique  from  GOR¬ 
DIAN  [KSJ88].  However,  it  is  not  clear  that  further  exploration  along  this  avenue  will  be  fruitful, 
given  the  relatively  good  results  obtained  from  using  the  annealing  technique  of  [MFNK96]. 

Here  we  have  not  explored  the  interaction  of  this  floorplanning  technique  with  retiming 
or  clock  skew  scheduling,  having  limited  our  study  to  wirelength  and  area  optimization  for  now. 
However,  net  weighting  approaches  such  as  those  presented  in  Chapter  3  and  Chapter  4  can  be 
applied  in  a  straightforward  manner.  Whether  other  useful  techniques  for  introducing  timing  or 
sequential  flexibility  in  our  nonlinear  floorplanning  framework  exist  remains  an  open  issue. 
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Chapter  8 

Conclusions 


The  progress  of  teehnology  has  highlighted  the  need  to  work  towards  the  goal  of  integrat¬ 
ing  previously  disparate  design  and  optimization  teehniques.  To  this  end,  we  have  shown  theoretieal 
results  to  justify  our  approaeh  to  the  problem  of  integration  of  physieal  design  and  sequential  op¬ 
timization,  and  sueeessfully  applied  a  number  of  sequentially- aware  lloorplanning  and  plaeement 
teehniques  to  various  benehmark  designs. 

There  are  still  a  number  of  open  questions  and  avenues  of  researeh  left  to  explore  in 
this  area.  While  our  plaeement  teehniques  have  been  validated  on  industrial  design  examples,  we 
do  note  that  more  experimental  results  would  be  benefieial  in  demonstrating  the  benefits  of  our 
floorplanning  approaehes.  Also,  fhere  are  ofher  design  goals,  sueh  as  roufabilify  and  eongesfion, 
fhaf  ean  be  explieifly  addressed  fo  furlher  exfend  fhis  work.  However,  our  work  demonsfrafes  fhaf 
fhe  infegrafion  of  physieal  design  and  sequential  opfimizafion  ean  be  sueeessful  for  physieally-aware 
liming  opfimizafion  of  digilal  eireuils  in  today’s  wire-dominanl  leehnologies. 
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